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Significant  Opportunity  for  Revolutionary  Reconfigurable  Processors 

BMDO  applications  often  require  fast  real-time  signal  processing  on  a  space  limited  platform.  The  major 
bottleneck  on  real  time  systems  is  throughput  rate  for  signal  processing  tasks.  Target  detection  and 
recognition  requires  processing  of  images  at  variable  precision  and  at  variable  rates  (depending  on  the 
proximity  of  the  imaging  device  and  the  target)  with  many  of  the  processing  tasks  to  be  done  in  parallel. 
One  can  exploit  this  trade-off  of  precision  and  processing  speed  to  increase  throughput  rate.  The  tool  for 
exploiting  this  trade-off  opportunity  is  configware,  i.e.  the  ability  to  reconfigure  the  internal  processing 
structure  of  FPGAs  (Field  Programmable  Gate  Arrays)  processors  and  FPICs  (Field  Programmable 
Interconnection  Circuits)  networking.  Configware  combines  the  great  flexibility  of  software  and  the 
significant  processing  speed  advantage  of  optimized  hardware. 

Specific  Details 

The  development  of  FPGA  processors  has  allowed  variable  precision  signal  processing  with  variable 
throughput  rates  to  be  done  on  a  single  platform.  The  reduced  precision  is  manifest  in  the  smaller  circuit 
sizes  needed  to  do  computations.  Smaller  circuits  means  more  computation  modules  can  be  fit  onto  a 
single  FPGA  chip,  so  parallelism,  and  therefore  throughput  rate,  is  increased.  Circuit  size  (precision)  and 
parallelism  (throughput  rate)  can  be  traded-off  against  one  another  to  achieve  acceptable  level  of  ones 
while  maximizing  the  other.  Another  method  for  optimizing  signal  processing  tasks  is  to  trade 
complexity  for  accuracy.  This  can  be  done  by  approximating  the  inner  product  operation  which  is 
ubiquitous  in  signal  processing.  Inner  products  are  used  in  adaptive  filtering,  neural  networks,  and 
correlation  matrix  estimation.  Toward  this  end,  nonlinear  estimators  have  been  developed  that  avoid  the 
costliest  computations,  i.e.  multiplication,  in  favor  of  less  costly  computations,  i.e.  addition  and  logical 
operations.  One  task  in  this  study  has  been  to  implement  the  nonlinear  correlation  estimators  on  FPGA 
processors.  The  FPGA  implementation  affords  the  ability  to  trade-off  dynamically  precision  for 
throughput  rate  (via  the  circuit  size/  parallelism  trade-off)  as  well  as  the  ability  to  tradeoff  accuracy  for 
throughput  rate  (via  the  choice  of  nonlinear  correlation  estimator). 

The  reconfiguration  afforded  by  the  FPGA  allows  adaptation  of  the  algorithm  and  precision/parallelism  to 
increase  speed  and  data  processing  bandwidth.  Thus  FPGAs  can  be  hardware  customized  for  algorithm 
and  situation  specifics.  In  contrast  to  DSPs  which  are  reprogrammable,  FPGAs  are  reconfigurable 
hardware,  not  software.  Consequently,  they  can  be  significantly  faster  than  DSP’s  when  multiple  types 
of  tasks  are  involved.  Also,  whereas  ASIC’s  and  DSP’s  have  fixed  input/output  bandwidths,  FPGA  I/O 
pins  can  be  dynamically  reassigned,  so  that  pin  functionality  can  be  traded  off  between  for  instance, 
control  I/O  and  signal  I/O.  Hence  FPGA  has  the  considerable  advantage  of  flexible  I/O  functionality  and 
bandwidth. 

Benefits:  Potentially  “The  Next  Netscape”  Innovation 

For  BMDO,  dynamic  processor  reconfiguration  could  enable  substantial  advancement  in  the  technology 
for  adaptive  target  detection  and  recognition  capability.  Initial  detection  and  long-range  tracking  involves 
operating  on  very  large  data  sets  of  low  precision  numbers.  Target  recognition  and  classification  requires 
more  complex  operations  on  a  smaller  amount  of  high  precision  data.  A  functional  switch  from  detection 
searching  to  lock  on  tracking  to  recognition  and  classification  can  be  done  on  an  FPGA-based 
architecture.  When  the  target  is  far  away,  massive  parallel  processing  of  low  precision  numbers  (e.g.  8 
bit)  using  low  complexity  reduced  accuracy  algorithms  is  carried  out.  Within  a  threat  range,  the  processor 
is  dynamically  reconfigured  to  carry  out  complex  operations  on  high  precision  (e.g.  32  bit)  numbers  using 
a  reduced  number  of  processing  channels.  The  "on-the-fiy”  reconfigurability  of  new  FPGA  architectural 
styles  will  enable  mission  environmental  adaptivity  for  the  ultimate  in  processing  performance.  The 
ability  to  dynamically  reconfigure  the  hardware  architecture,  will  enable  truly  adaptive  computing  and  not 
just  adaptive  coefficient  reprogramming.  This  will  revolutionize  adaptive  computing  and  usher  in  the 
next  most  significant  development  in  computing  since  the  innovation  of  Netscape  and  Internet  Explorer 
type  internet  browser  software. 
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Diagram  2.  Target  Detection,  Tracking,  and  Recognition  at  varying  ranges  require  variable 
precision  and  processing  speed  and  therefore  in-flight  reconfigurability 
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Diagram  1.  Nonlinear  Correlation  Estimators  allow  speed/accuracy  trade-off 
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Nonlinear  Correlation  estimators  have  much  higher  processing  speeds  (3-5  times  faster)  and 
therefore  higher  throughput  rates  than  the  conventional  correlation  estimators 
FPGA  reconfigurability  allows  area  of  correlation  estimator  to  be  traded-off  against 
computation  time  (and  therefore  processing  speed) 
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Executive  Summary 

Effective  missile  defense  requires  computer  systems  with  extraordinary  real-time  computing 
capacity.  Parallel  architectures  are  necessary  to  provide  these  levels  of  performance.  High 
performance  computing  systems  can  be  implemented  by  connecting  large  numbers  of 
conventional  processors,  but  few  algorithms  are  able  to  fully  exploit  the  raw  computational 
resources  in  the  resulting  system. 

Alternative  architectures  can  be  developed  by  integrating  the  design  of  the  numerical  algorithm 
with  the  computing  hardware.  One  such  emerging  technology  is  reconfigurable  computing  based 
on  Field  Programmable  Gate  Arrays  (FPGAs).  Most  conventional  signal  processing  algorithms 
do  not  map  well  to  FPGA  resources,  but  the  same  functionality  can  often  be  provided  with 
algorithms  that  have  been  tailored  to  operate  in  a  FPGA  environment. 

ISL  has  developed  nonlinear  operators  that  have  the  potential  for  implementation  on  FPGAs  and 
can  be  used  to  carry  out  correlation  estimation,  matched  filtering,  adaptive  filtering,  and  neural 
networks.  The  objective  of  Phase  I  of  this  SBIR  was  to  explore  the  possibility  of  doing 
corrrelation  estimation  in  FPGA  processors.  Is  implementation  of  these  estimators  feasible  in 
FPGA  technology?  How  do  the  different  estimators  compare  in  terms  of  performance  measures 
appropriate  to  an  FPGA  implementation?  How  do  the  estimators  compare  in  a  specific 
application  to  each  other?  Can  FPGA  processors  be  connected  in  a  parallel  configuration  to 
afford  more  processing  power? 

In  Phase  I  of  this  SBIR,  we  accomplished  the  following  goals  set  out  in  the  Phase  I  proposal: 

•  Chose  an  FPGA  ‘style’  in  which  to  implement  the  correlation  estimators. 


•  Encoded  the  conventional,  hybrid  sign,  and  quadruplex  estimators  in  a  very  high  level  design 
language  (VHDL)  and  generated  schematics  to  establish  the  feasability  of  VHDL  to  FPGA 
implementation. 

•  Simulated  the  correlation  estimators  and  determined  area/time  tradeoffs  for  correlation 
estimators  for  both  bit-serial  and  bit-parallel  implementations. 

•  Considered  the  relative  computational  complexities  of  the  different  correlation  estimators 
relative  to  one  another. 

•  Developed  a  model  for  and  insights  into  a  multiple  interconnected  FPGA  reconfigurable 
computer  and  studied  potential  applications  of  this  reconfigurable  computer  architecture  to 
applications  of  interest  to  BMDO. 

•  Studied  the  effect  of  correlation  estimation  errors  and  compared  computational  complexity  in 
a  specific  example:  Direction  of  Arrival  Estimation. 
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The  correlation  estimators  we  have  developed  and  implemented  have  a  multitude  of  applications 
for  BMDO.  FPGA-based  reconfigurable  computing  architectures  using  these  operators  could 
significantly  enhance  the  capabilities  of  a  real  time  missile  defense  system  without  requiring  the 
size,  power,  and  expense  of  conventional  parallel  processing  systems.  We  will  discuss  some  of 
these  capabilities  in  the  following  section. 

1  Introduction:  The  Benefits  of  Reconfigurable  Processing  for  BMDO 
1 . 1  Conventional  Processing  Limita  tions 

Historically,  missile-borne  avionics  used  processing  approaches  employing  a  general-purpose 
microprocessor  (GPM)  as  the  computational  unit  of  choice.  In  recent  years,  some  of  the 
functions  requiring  extensive  localized  computing  have  been  committed  to  Digital  Signal 
Processing  (DSP)  processors.  DSP  processors  and  general-purpose  microprocessors  differ 
basically  in  how  they  are  primarily  used.  Microprocessors  typically  aren’t  used  in  direct  signal 
path  real-time  computation  and  need  to  shift  around  their  workloads  before  responding  to  another 
user  command.  Exceptions  include  some  embedded  computer  architectures  with  suitable 
throughput  rates  for  the  application  of  interest.  DSP  processors  are  more  prevalently  geared  for 
processing  signals  that  originate  directly  firom  the  analog-to-digital  signal  source  and  are  usually 
dedicated  to  computing  a  sequence  of  single  tasks  at  a  time.  However,  both  general-purpose 
microprocessors  and  DSPs  are  limited  in  I/O  bandwidth  and  execute  Fixed  Instruction  Sets. 
Although  they  can  be  reprogrammed  to  execute  an  alternative  software  program,  they  do  not 
have  direct  hardware  reconfigurability  and  I/O  flexibility  that  FPGAs  (Field  Programmable 
Gate  Arrays)  do  have.  When  combined  with  FPICs  (Field  Programmable  Interconnects),  a  very 
robust  architecture  can  be  defined.  Alternatively,  Application  or  Algorithm  Specific  Integrated 
Circuits  (ASICs)  can  have  the  best  in  terms  of  low  power,  maximum  functional  density  and 
speed,  but  they  lack  either  reprogrammability  or  reconfigurability.  The  benefits  of  FPGA 
compared  to  traditional  processors  include: 

•  FPGA  implementations  are  potentially  many  times  faster  than  DSP. 

•  FPGA’s  can  handle  variable  precision  computations  whereas  DSP’s  handle  only 
either  fixed  or  full  floating  point. 

•  ASICS  are  not  re-usable  for  other  tasks,  FPGA’s  are  reconfigurable. 

•  Algorithms  and  Interconnect  can  be  dynamically  and  on-the-fly  reconfigured  on 
FPGA/FPIC  processors  which  is  not  the  case  with  fixed  GPM  or  DSP  processor 
boards. 

•  FPGA/FPIC  architectures  can  implement  VLIW  (Very  Long  Instruction  Word) 
computing  for  wide,  concurrent  and  very  fast  processing. 

•  FPGA  functional  density  is  rapidly  increasing,  approaching  million-)-  CLB 
(Combinatoric  Logic  Blocks)  and  can  potentially  implement  any  COTS  Processor 
Architecture.  This  would  enable  backward  compatibility  with  legacy  software 
already  developed  and  enables  the  latest  in  algorithmic  technology  insertion. 
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1.2  Flexible  Field  Programmable  Gate  Array  Implementations 

The  time  to  market  within  the  VLSI  design  cycle  has  rapidly  shrunk  at  a  time  when  system 
complexity  has  intensified  considerably.  These  constraints  and  other  pressures  have  led 
developers  to  seek  methods  of  alleviating  this  dilemma  in  order  to  shorten  the  design  cycle  and 
reduce  costs.  Expeditious  planning  and  development  cycles  has  evolved  into  the  norm  for  just 
remaining  competitive  in  today’s  complex  and  booming  telecommunications  markets.  Field 
programmable  gate  array  (FPGA)  technology  has  been  adopted  as  one  of  the  primary  methods  of 
alleviating  the  situation  at  hand.  Ironically,  the  very  tools  that  are  often  dismissed  today  as 
simply  a  by-product  within  the  application  specific  integrated  circuit  (ASIC)  design  cycle  may 
grow  to  encompass  a  lion’s  share  of  the  end  product  as  well.  Leading  FPGA  manufacturers  are 
planning  to  provide  400,000  gate  FPGAs  with  built-in  features  such  as  DRAM  before  the  end  of 
the  decade. 

The  flexibility  and  quick  response  of  FPGA  technology  to  the  changing  market  scene  has  led  it 
to  be  utilized  to  perform  a  variety  of  different  tasks.  In  the  process,  a  natural  progression  in  terms 
of  usage  has  quietly  been  going  on.  At  first  systems  were  used  primarily  for  their  rapid 
prototyping  capability  in  industry  and  system  emulation  in  academia.  And  this  remains  largely 
the  case  today.  However,  as  more  people  become  aware  of  the  applications  of  configware  this  is 
beginning  to  change.  Furthermore,  as  partial  reconfigurable  FPGAs  with  increasing  circuit 
density  become  available,  then  the  technology  will  progress  from  dynamic  reconfiguration  to  the 
more  demanding  on-the-fly  reconfiguration  necessary  for  next  generation,  smart,  adaptive  and 
evolvable  hardware. 

Configware  represents  a  hybridization  of  the  conventional  usage  of  standard  central  processing 
units  (CPUs)  and  full  custom  ASICs  in  computers.  Configware  platforms  have  numerous 
advantages  over  traditional  forms  of  hardware  and  software  solutions  for  processing  problems. 
Innovative  hardware  platforms,  such  as  a  ReConfigurable  Processor  (RCP),  blur  the  traditional 
boundaries  that  exist  between  hardware  and  software  to  bring  out  the  best  features  of  both 
worlds.  A  system  that  is  blazingly  fast  but  extremely  adaptive  to  its  operating  environment.  No 
longer  is  hardware  strictly  limited  to  application  specific  scenarios.  No  longer  is  software  just  an 
additive  delay,  inhibiting  system  performance.  RCP  Systems  point  the  way  to  tomorrow’s  smart 
machines,  stimulating  new  advances  in  hardware  and  software  technology.  As  a  new  layer 
between  hardware  and  software,  configware  requires  a  novel  programming  methodology  to 
create  itself.  Such  technology  is  referred  to  as  a  very  high-level  description  language  (VHDL), 
software  which  compiles  itself  into  a  virtual  architecture  structure  from  its  binary  eode.  These 
virtual  architectures  are  then  mapped  into  actual  hardware  systems  to  produce  the  earlier 
mentioned  results. 


Field-programmable  gate  arrays  (FPGAs)  are  integrated  circuits  which  incorporate  many 
identical  logic  blocks  on  a  chip,  are  interconnected  on  the  device  as  needed  and  can  be  reused  for 
different  applications.  In  comparison,  a  custom-designed  application-specific  integrated  circuit 
(ASIC)  is  not  reusable  for  other  tasks. 
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The  gate-count  density  has  been  projected  to  reach  2  million  in  the  next  five  years,  therefore 
assume  more  of  the  logic  implementation  on  a  monolithic  processor  without 
requiring  a  significant  amount  of  area.  They  can  provide  parallel  data  processing,  which  can 
render  a  considerable  increase  in  speed  when  compared  to  a  dedicated  DSP  processor  solution. 

Three-dimensional  Inertial  Navigation,  a  significant  BMDO  application,  requires  accurate  sensor 
signal  processing  and  an  adaptive  method  for  assessing  trajectory  errors.  The  sensor  signal 
processing  has  traditionally  been  handled  by  a  custom  application-specific  circuit,  which  would 
convert,  accumulate  and  synchronize  the  sensor-data  into  a  format  usable  for  normal  data 
processing.  If  a  sensor  type  was  modified  or  replaced,  the  ASIC  would  have  to  be  redesigned 
which  requires  a  significant  amount  of  time  in  the  physical  layout  of  the  new  device  and  causes 
potential  schedule  delays  and  increased  cost.  The  FPGA’s  increasing  gate-count  density  makes 
this  technology  a  better  solution  for  the  sensor  signal  interface.  The  speed  of  data  acquisition  can 
be  well  over  10  MHz,  which  is  at  least  two  magnitudes  above  the  resolution  needed  for  accurate 
missile  navigation  and  control.  The  flexibility  of  the  device  affords  reconfiguration  without 
hardware  redesign  when  a  sensor  changes  its  error-model  characteristics  or  is  replaced  by  a 
different  sensor  technology. 

After  the  initial  signal  acquisition,  all  sensors  require  compensation  for  eliminating  any  errors 
caused  by  peculiarities  of  the  device.  Coefficients  from  a  bias  readjustment  up  to  a  third  order 
correction  are  required  to  compensate  for  many  non-ideal  characteristics.  In  addition  thermal 
compensation  must  be  included  since  the  output  of  these  devices  is  extremely  sensitive  to 
temperature.  These  calculations  require  multiplying  the  coefficients  by  the  conditioned  sensor 
data,  then  subtracting  each  product  fi-om  the  sensor  signal.  With  proper  scaling,  these 
compensation  coefficients  and  the  subsequent  error  correction  architecture  can  also  be 
iniplemented  in  FPGAs.  This  technique  preserves  the  parallel  processing  of  the  sensor  data 
initiated  during  the  sensor  signal  processing  and  therefore  provides  an  increase  in  computational 
speed  over  the  typical  implementation  performed  by  either  a  GPP  or  a  DSP.  This  method  also 
reaps  the  benefits  of  selecting  the  best  architecture  for  implementation,  since  it’s  topology  is  very 

similar  to  a  low-pass  Finite  Impulse  Response  (FIR)  filter  which  has  a  multitude  of  possibilities 
in  this  technology. 


Low-pass  FIR  filtering  that  is  commonly  performed  to  attenuate  undesirable  higher  frequency 
errors  caused  by  vehicular  vibration  and  other  environmental  sources  can  also  be  performed  in 
the  FPGA.  The  FPGA  filter  architecture  is  much  more  flexible  than  a  DSP  processor 
implementation  since  the  latter  has  a  fixed  multiplier  architecture.  FPGAs  can  benefit  from 
using  one  of  several  possible  architectures.  Xilinx  Corporation,  a  major  source  for  FPGA 
components,  performed  a  study,  which  compared  their  standard  product  to  the  newest  and  fastest 
DSP  processor  from  Texas  Instruments  (TI).  The  FPGA  implementation  using  XC4000  series 
devices  resulted  in  a  multiplier  up  to  ten  times  as  fast  as  the  TMS320C6X  family  (TI)  DSP 
processor. 

Multiply-accumulate  (MAC)  computations  are  required  in  many  missile  detection,  navigation 
and  control  functions.  For  instance,  the  coordinate  transformations  for  converting  the  sensor 
orientations  from  the  missile  body  alignment  to  the  inertial  frame  coordinates  in  a  strapdown 
inertial  measurement  system  require  several  matrix  manipulations,  which  could  be  accomplished 
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with  FPGAs.  This  implementation  would  leverage  off  of  the  parallel  computation  possibilities, 
and  combine  increased  computational  speed  with  improved  data  synchronization. 


2  Phase  I  Research  Objectives 

2. 1  Problem  Description:  Correlation  Matrix  Estimator  (CME) 

2.1.1  Case  Study:  Advantages  of  FPGA  Implementation 

In  Japan,  the  Nobeyama  radioheliograph  monitors  the  surface  of  the  sun  for  evidence  of  solar 
flares  using  an  array  of  eighty-four  antennas.  Generating  images  of  the  solar  surface  in  real  time 
requires  correlating  3486  distinct  pairings  of  antenna  outputs.  Since  each  antenna  produces  40 
million  complex  samples  per  second,  computing  the  correlations  consumes  just  under  560  billion 
multiply-accumulate  operations  per  second.  At  20  MIPS  each,  nearly  30,000  DSP  chips  would 
be  required  for  this  task! 

The  design  team,  however,  found  a  way  to  solve  the  problem  with  a  relatively  modest  array  of 
231  custom  integrated  circuits.  The  engineers  didn’t  use  an  exotic  technology;  the  custom  ICs 
were  ordin^  CMOS  devices  clocked  at  10  MHz.  Instead,  they  hard-limited  the  data  to  ±1  as 
shown  in  Figure  1  so  that  all  of  the  multipliers  could  be  replaced  by  exclusive-OR  gates  and  all 
of  the  accumulators  could  be  replaced  by  counters.  This  technique  is  also  known  as  the  polarity 
coincidence  method  because  it  counts  the  number  of  times  the  input  sign  bits  agree. 


I  CONVENTIONAL  CORRELATOR 


INPUTS  DIGITIZERS  MULTIPLIER  ACCUMULATOR  OUTPUT 


I  HARD-LIMITING  CORRELATOR 


INPUTS  LIMITERS  XOR  GATE  COUNTER  OUTPUT 

Figure  1.  Conventional ,  F  igure  2.  Hard-limiting  Correlators. 


This  example  illustrates  the 
potential  economic  benefits  of  an 
integrated  approach  to  the  design 
of  complex  digital  signal 
processing  systems.  When  the 
algorithm  and  the  hardware  are 
designed  jointly  and  not  in 
isolation,  the  added  flexibility 
permits  design  tradeoffs  that  can 
reduce  system  costs  by  more  than 
an  order  of  magnitude.  Research 
conducted  at  ISL  on  the  use  of 
nonlinear  processing  elements  for 
DSP  suggests  that  computationally 
intensive  real-time  algorithms  can 
be  reformulated  in  terms  of  simple 
arithmetic  operations  that  are  easily 
implemented  in  digital  hardware. 
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2.1.2  FPGA  Based  DSP 


The  advent  of  the  Field  Programmable  Gate  Array  (FPGA)  provides  an  opportunity  to  exploit 
these  techniques  to  achieve  extraordinary  processing  rates  in  inexpensive  programmable 
hardware.  FPGAs  are  integrated  circuits  that  provide  the  functionality  of  custom  logic  without 
the  initial  cost  or  development  time.  Software  tools  are  used  to  enter  and  then  simulate  a 
hardware  design  that  can  then  be  downloaded  into  the  FPGA  (with  some  FPGAs,  this 
permanently  configures  the  device,  but  others  can  be  reconfigured  with  different  designs  as  the 
need  arises).  Designs  can  be  entered  in  schematic  form  or  they  can  be  coded  in  a  hardware 
description  language  such  as  VHDL  or  Verilog.  The  design  is  “compiled”  into  a  configuration 
file  used  to  program  a  particular  class  of  FPGAs.  Although  compiler  efficiencies  are  improving 
as  this  technology  matures,  a  little  human  interaction  is  usually  necessary  to  meet  tough  design 
requirements. 

technology  is  evolving  rapidly;  the  maximum  available  density  is  currently  making  a 
transition  from  25,000  equivalent  gates”  to  100,000-gate  devices  that  are  currently  available, 
and  250,000-gate  low  power  (2.5V  supply  voltage)  ones  are  due  2Q’98.  Primitive  logic  elements 
such  as  flip-flops  can  be  operated  at  clock  rates  exceeding  100  MHz,  but  inter-element  routing 
delays  generally  limit  maximum  system  clock  rates  to  between  30  and  60  MHz.  Power 
consumption  depends  upon  the  device,  the  function  being  programmed,  and  the  clock  rate. 
Complex,  high-speed  designs  can  easily  dissipate  several  watts.  3.3V  devices  are  available  for 
low-power  applications,  with  the  goal  of  1.8  V  at  .1  micron  feature  size  in  the  next  5  years. 

A  typical  FPGA  is  a  rectangular  array  of  simple  processing  elements  that  communicate  across  a 
grid  of  programmable  interconnects.  The  fundamental  characteristic  that  distinguishes  different 
FPGA  architectures  is  the  tradeoff  between  the  complexity  of  these  processing  elements  and  the 
degree  of  interconnectivity  between  elements.  A  processing  element  might  be  as  simple  as  a 
four-input  multiplexer  and  a  flip-flop,  or  as  complicated  as  four  four-input  look-up  tables  and 
four  flip-flops. 

FPGAs  are  not  a  universal  solution  to  all  DSP  hardware  design  problems.  Programmable  DSP 
chips  are  adequate  for  most  applications,  and  high  performance  ASICs  have  been  developed  for 
many  common  functions  including  FFTs  and  digital  filters.  As  the  example  of  the  Nobeyama 
radioheliograph  shows,  however,  some  real-time  applications  have  computational  requirements 
that  exceed  the  capabilities  of  standard  components.  FPGAs  should  be  considered  in  these 
situations  if  development  budgets  and  schedules  are  tight  or  if  reconfigurability  is  important. 
Array  processing  and  ‘  software  radios”  come  to  mind  as  applications  where  FPGA-based  DSP 
appears  viable. 


The  key  to  using  FPGAs  in  DSP  applications  is  to  reformulate  algorithms  in  terms  of  parallel 
operations  that  are  easily  implemented  with  chains  of  simple  processing  elements.  The 
complexity  of  operations  such  as  addition,  subtraction,  multiplexing,  and  comparison  are  linear 
in  the  number  of  bits  used  to  represent  data  and  are  readily  implemented  in  FPGAs.  The 
complexity  of  a  multiplier,  on  the  other  hand,  is  quadratic  in  the  number  of  bits.  The  advantages 
of  an  FPGA  implementation  are  significantly  diminished  if  the  target  algorithm  uses  multipliers. 
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Since  most  DSP  algorithms  rely  on  multiplication,  this  would  appear  to  be  a  severe  limitation, 
but  a  variety  of  “shortcuts”  are  available  to  avoid  implementing  multipliers.  General-purpose 
multipliers  can  be  eliminated  in  digital  filters  because  the  input  data  are  multiplied  by  known 
constants.  Filters  with  surprisingly  good  characteristics  can  be  designed  with  coefficients  that 
are  powers  of  two  or  sums  of  a  few  powers  of  two.  Multiplication  by  such  coefficients  can  be 
implemented  on  an  FPGA  with  a  small  number  of  fixed  bit-shifts  and  adds.  “Distributed 
arithmetic”  approach  replaces  multipliers  with  look-up  tables  and  adders. 

2.1.3  Nonlinear  Correlation  Multipliers 

The  introductory  example  illustrates  the  use  of  hard  limiters  to  estimate  correlations,  but  there 
are  a  number  of  other  nonlinearities  that  simplify  multiplications  as  shown  in  Table  1.  These 
nonlinearities  are  designed  to  work  with  zero-mean,  equal  variance  Gaussian  data,  but  useful 
results  are  generally  obtained  fi-om  other  zero-mean  distributions. 

Table  1.  Covariance  Estimators. 


Conventional 

_1_ 

N 

N 

Ex,Y, 

/=! 

Hybrid  Sign 

n 

2 

Quadruplex  (See  Table  2) 

n 

2 

The  hybrid  sign  estimator  from  Table  1 
can  be  implemented  as  a  pair  of 
accumulators.  An  experimental  design 
implemented  on  a  Xilinx  4000  series 
FPGA  used  a  pair  of  vendor-supplied 
macros  (a  sixteen-bit  adder/subtracter  and 
a  sixteen-bit  register)  connected  to 

N 

compute  EIy,|.  A  slight  modification  to 

/=i 

N 

this  arrangement  yielded  an  accumulator  design  for  computing  ^sign(Y.)X..  In  operation, 

i=i 

both  accumulators  would  be  cleared  before  processing  a  sequence  of  N  {X,  Y)  pairs.  The 
accumulator  outputs  would  then  be  read  by  a  conventional  programmable  processor  that  would 
perform  the  remaining  scaling  and  (single)  multiplication  to  evaluate  the  estimator.  This 
processor  would  use  the  estimated  covariance  to  perform  some  aspect  of  the  DSP  function. 

A  resulting  implementation  would  use  18  FPGA  processing  elements.  A  subsequent  timing 
analysis  indicates  that  the  design  would  operate  at  a  maximum  clock  rate  of  33.2  MHz.  If  the 


Table  2.  The  quadruplex  transformation. 


-X<Y<X 
-Y<X<  Y 
-X>  Y>X 
-Y>X>  Y 


f(X,Y)  g(X,Y) 


Y 

X 

X 

Y 

-Y 

-X 

-X 

-Y 
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object  is  to  estimate  a  single  covariance  then  the  FPGA  offers  no  significant  advantages  over  a 
conventional  DSP  chip.  On  the  other  hand,  if  the  object  is  to  estimate  a  symmetric  10x10 
covariance  matrix,  a  systolic  array  of  55  pairs  of  accumulators  operating  in  parallel  could  do  the 
job  at  a  rate  equivalent  to  over  1.8  billion  multiply-accumulate  operations  per  second. 
Replicating  the  hybrid  sign  arithmetic  circuitry  55  times  would  require  fewer  than  1000 
processing  elements,  a  quantity  that  could  be  comfortably  fit  onto  a  single  high-density  FPGA. 
Conventional  programmable  DSP  chips,  with  instruction  rates  well  under  100  MHz,  are  poorly 
matched  to  this  task. 

2.1.4  Performance  Tradeoffs 

Different  nonlinear  multipliers  offer  different  tradeoffs  between  performance  and  complexity. 
The  more  complicated  nonlinearities  retain  more  information  and  thus  offer  improved 
performance  (lower  variance).  The  improved  performance,  however,  is  achieved  at  a  cost  of 
additional  hardware.  No  single  technique  is  “best”  for  all  applications. 

The  performance  of  the  hybrid  sign  estimator  is  compared  with  that  of  the  quadruplex  estimator 
in  Figure  3.  The  variance  of  the  each  estimate  is  normalized  by  the  variance  of  the  conventional 
estimator  and  then  shown  as  a  function  of  correlation.  These  curves  are  generated  from 
approximations  that  are  valid  for  large  numbers  of  Gaussian  pairs  {(X.,Y.)}  where  (X;,yJ  is 

independent  of  ^y,Yy)  for  j .  The  hybrid  sign  estimator  provides  the  greatest  computational 

simplicity,  but  the  reliability  of  the  estimate  is  relatively  poor.  Another  shortcoming  is  the 
variation  in  performance  with  correlation;  if  you  need  to  specify  a  fixed  number  of  samples  to 
provide  a  desired  variance,  without  prior  knowledge  of  the  correlation  you  must  assume  the 
worst-case  performance  at  zero  correlation. 

Table  3.  Correlation  estimators. 


Conventional 

N 

Polarity 

Coincidence 

Hybrid  Sign 

2sign(Y,)X, 

Z|Y,I 

/=! 

Quadruplex 

+[Zg(X„Y,)j 
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Figure  3.  Covariance  estimator  variance 
(normalized  by  the  variance  of  the  conventional 
estimator)  for  independent  Gaussian  pairs  as  a 
function  of  correlation  p. 


Correlation 

Figure  4.  Correlation  estimator  variance  as  a  function  of 
correlation  p  for  independent  Gaussian  observations. 


In  some  applications  the  input  signal  level  is  a 
"nuisance  parameter"  that  is  often  eliminated  by  a 
suitable  normalization  that  yields  an  estimate  of 
the  correlation  coefficient  p  €[—1,1].  Corresponding  to  each  covariance  estimator  in  Table  1  is 
an  estimator  of  the  correlation  coefficient  as  shown  in  Table  3,  and  the  differences  in 
performance  between  these  correlation  estimators  are  shown  in  Figure  4.  The  tradeoffs  between 
the  normalized  hybrid  sign  and  quadruplex  estimators  are  similar  to  those  encountered  with  the 
corresponding  covariance  estimators.  The  polarity  coincidence  estimator  offers  extreme 
simplicity  at  a  cost  of  greater  variance  than  either  of  the  other  nonlinear  techniques.  The 
differing  tradeoffs  between  performance  and  complexity  of  the  four  correlation  estimators  are 
illustrated  in  Figure  5.  These  relative  complexities  are  only  representative  and  actual  differences 
between  implementations  will  vary  depending  upon  the  technology  employed  and  the  choice  of 
design  tradeoffs. 


2.1.5  Generalization  to  Complex  Signals 

Many  digital  signal  processing  algorithms  are  designed  to  operate  on  complex  signals  resulting 
from  quadrature  downconversion.  Each  of  the  covariance  estimators  in  Table  1  may  be 
generalized  for  complex  data  Z.  =X.+  7  Y.  ,  W.  =U.-f  / V.  as  shown  in  Table  4. 

In  Figure  6  the  variances  of  the  complex  hybrid  sign  and  complex  quadruplex  covariance 
estimates  are  normalized  by  the  variance  of  the  conventional  estimator  and  then  shown  as  a 
function  of  correlation  for  three  different  values  of  phase.  The  curves  are  generated  according  to 
approximations  that  are  valid  for  large  numbers  of  independent,  complex,  and  proper  Gaussian 
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pairs.  The  performance  of  each  of  the  correlation  estimators  from  Table  5  is  shown  in  Figure  7. 
The  major  difference  between  these  figures  and  the  corresponding  curves  for  real  Gaussian  pairs 
is  that  the  performances  of  the  hybrid  sign  and  polarity  coincidence  estimators  now  exhibit 
variability  with  the  phase  of  the  correlation  as  well  as  with  the  magnitude. 


Figure  5.  Representative  Performance  Tradeoffs. 


The  hybrid  sign  estimator  uses  a  complex  hard-limiter  csign(W,.)=-^=-(sign(U,.)+7sign(V,)) 
that  takes  on  one  of  four  possible  values  {±1  ,±7}.  The  complex  quadruplex  estimator  requires 


twice  as  many  additions  as  the  complex  hybrid  sign  estimator,  but  the  improved  performance 
may  be  worth  the  extra  effort  in  some  applications.  Complex  generalizations  of  the  correlation 
estimators  in  Table  3  are  presented  in  Table  5. 


Table  4.  Covariance  estimators  for  complex  data. 
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clarity  Coincidence 


ybrid  Sign 


+>sin[^^ csign  (W,)csign(Z;) 
Resign*  (W,.)Z, 


^  csign*  (Z,)Z, 

/  =  ! 


2  Zf(X.U,> f(Y,V,.))[fg(X,Y,.)+g(Y,V^)^ 


Zf(X.U,.)+f(Y„Vj  +  Zg(X.Y,)+g(Y„V,) 


2  Zf(Y.U,)-f(X„V,)  Xg(Y.U,)+g(X„V,) 


Zf(Y.U,.)-f(X,V,,)  +  2:8(Y.U,>g(X.V,,) 


1.8 


Complex  Polarity  Coincidence 


0  0.2  0.4 

CORRELATION  MAGNITLFDE 


Figure  6:  Covariance  Variance 


Complex  Hybrid  Sign 


^  Complex  Quadruplex 


0  O.I  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  I 

Magnitude  of  Correlation 

Figure  7:  Correlation  Variance 


Figure  6.  Covariance  estimator  variance  (normalized  by  the  variance  of  the  conventional 
estimator)  for  independent  complex  Gaussian  pairs  as  a  function  of  p.  The  magnitude  of  p  is 
given  by  the  horizontal  axis  and  the  phase  of  the  correlation  is  given  by  the  line  type.  Dotted 
lines:  p  =  Q|.  Long-dashed  lines:  B  =n/S ,  Short-dashed  lines:  9  =  7i/4|. 

Figure  7.  Correlation  estimator  variance  for  independent  complex  Gaussian  pairs  as  a  function 
of  p.  The  magnitude  of  p  is  given  by  the  horizontal  axis  and  the  phase  of  the  correlation  is 
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given  by  the  line  type.  Dotted  lines:  |Q  -  Q 1.  Short  dashed  lines:  6  =  tt/  8  .  Long  dashed  lines: 
|9  =  7i/4 


Solid  line:  Variance  of  the  conventional  estimator. 


2.2  Correlation  Matrix  Estimator  (CME)  Applications 

Estimates  of  covariance  or  correlation  are  seldom  obtained  for  their  own  sake.  More  often  these 
estimators  are  employed  as  an  intermediate  stage  of  processing  to  obtain  some  other  desired 
information.  Potential  applications  are  abundant  and  include  beamforming,  direction-finding, 
spectral  analysis,  analysis  of  cyclostationary  signals,  adaptive  filtering,  matched  filters,  and 
neural  networks.  In  the  sections  that  follow,  computer  simulation  experiments  are  used  to 
illustrate  the  application  of  nonlinear  multipliers  to  these  diverse  signal  processing  topics. _ 


Figure  8.  Overlaid  results  of  10  simulation  runs,  (a)  Conventional  MUSIC,  (b)  Quadruplex  MUSIC,  (c) 
Hybrid  Sign  MUSIC,  (d)  Polarity  Coincidence  MUSIC. 
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2.2.1  Direction  Finding 


The  initial  step  of  many  direction  finding  algorithms  is  to  estimate  the  covariance  between 
samples  collected  from  different  antenna  elements.  This  estimate  may  be  computed  with  any  of 
the  methods  given  in  Table  1.  The  tradeoffs  between  performance  and  complexity  will  be 
illustrated  with  an  example  using  the  MUSIC  algorithm  to  estimate  angles  of  arrival  of  two 
sources  from  measurements  taken  from  a  uniform  linear  array.  The  MUSIC  algorithm 
transforms  the  covariance  function  into  a  function  of  spatial  frequency  with  peaks  near  the 
frequencies  of  the  sources.  This  function  is  commonly  called  a  “spatial  spectrum”  even  though 
the  peak  heights  do  not  directly  correspond  to  power. 

Figure  8  shows  how  nonlinear  correlation  estimators  affect  the  output  of  the  MUSIC  algorithm. 
Two  sources  were  generated  at  normalized  frequencies  of  0.21  and  0.27  with  signal-to-noise 
ratios  of  20  dB  and  17  dB,  respectively.  “Spectra”  from  ten  simulation  runs  are  overlaid  in  each 
graph.  The  conventional  correlator  produces  the  graph  with  the  least  variation  between 
simulation  runs.  The  most  complicated  nonlinearity,  the  quadruplex  correlator,  exhibits  slightly 
more  variation,  followed  by  the  hybrid  sign  and  polarity  coincidence  correlators.  The  RMS 
spatial  frequency  deviations  given  in  Table  6  further  demonstrate  that  the  precision  of  the  angle 
estimate  increases  with  the  complexity  of  the  covariance  estimator. 

Note  that  the  weaker  source  is  much  less  distinct  in  the  graph  generated  with  the  polarity 
coincidence  correlator.  If  the  difference  between  the  amplitudes  of  the  two  sources  is  increased, 
the  peak  associated  with  the  weaker  source  disappears.  This  limited  dynamic  range  is  typical  of 
algorithms  based  on  hard-limiting. 

Table  6.  RMS  spatial  frequency  deviations  of  the  four  algorithms  for  source  at  0.27. 

2.2.2  Stochastic  Gradient  Filters 


In  the  previous  example,  an  algorithm 
explicitly  estimates  covariance  as  an 
intermediate  step  in  the  estimation  of  the 
angles  of  arrival.  Many  adaptive 
algorithms  implicitly  estimate  covariance. 
One  such  algorithm  is  the  well-known 
LMS  or  stochastic  gradient  filter.  A 
diagram  of  an  LMS  adaptive  filter  is 

shown  in  Figure  9. 

Given  an  input  x(«)  =  [x,(n),X2(«^...,x^(«)]  ,  the  problem  is  to  determine  the  filter  weights 
h(n)  =  which  minimize  the  mean  squared  value  of  the  difference 

e{n)  =  d{n)- yin)  between  the  filter  output  X«)  =  («)x(/i)  and  a  reference  signal  dip).  The 
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solution  to  this  least-squares  problem  is  the  set  of  weights  h(Aj)  that  makes  the  error  signal  e(n) 
orthogonal  to  the  inputs  x(n)  so  that  the  average  value  of  e{n)xi{n)  is  zero. 


Figure  9.  Adaptive  filter. 


The  implicit  correlation  estimation  is  revealed  by  recognizing  the  update  recursion  as  nothing 
more  than  a  discrete  integrator.  As  h(«)  approaches  the  optimal  value,  the  correlation  between 
and  decreases  and  the  average  increment  in  the  update  recursion  approaches  zero. 

This  suggests  that  the  update  recursion  might  still  work  when  the  multiplication  of  and 

x.{n)  in  (1)  is  replaced  by  some 
nonlinear  function  / (^(m)  ,  x.  (n)) 

so  long  as  the  average  value  of  the 
function  is  zero  when  e(n)  and 
Xjin)  are  imcorrelated.  There  is  no 
requirement  that  the  average  value 
of  the  function  be  proportional  to 
this  correlation,  but  odd 
monotonically  increasing  functions 
of  correlation  would  appear  to  be 
desirable.  Table  7  lists  several 
candidate  correlation  multipliers 
taken  from  the  covariance  and 
correlation  estimators  in  Tables  1 

,  and  3  (the  signed  maximum 

Figure  10.  LMS  filter  experiment.  multiplier  is  just  f(X,  Y)  from  the 

quadruplex  given  in  Table  2). 

The  stochastic  gradient  algorithm  updates  the  filter  weight  vector  iteratively  according  to 
hi{n  +  \)  =  h,{n)+\ie{n)x,{n)  (1) 

where  p  is  a  constant  which  determines  the  convergence  rate  and  steady-state  error  of  the  filter. 
Increasing  the  value  of  p  provides  more  rapid  convergence  until  too  great  a  value  of  p  is 
reached  and  the  update  algorithm  becomes  unstable.  More  rapid  convergence  must  be  traded  off 
against  higher  levels  of  steady-state  error  caused  by  increasingly  noisy  filter  weights. 
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Table  7.  Nonlinear  correlation  multipliers. 
Filter  Type  )) 


Conventional 

ex,{n) 

Signed  Maximum 

sign(e  x.(n))min([et|x,(«)|) 

Signed  Regressor 

e  sign(x.(n)) 

Signed  Error 

sign(e)x,(«) 

Signed  Product 

sign(e  x,(«)) 

1.05 


The  performance  of  each  of  the 
multipliers  in  Table  7  was 
evaluated  in  a  computer 
experiment  as  shown  in  Figure 
10.  The  input  vector  x(n)  was 
obtained  by  taking  sequential 
samples  from  an  autoregressive 
process  generated  by  passing 
“white”  Gaussian  noise  through 
a  third-order  recursive  filter. 
The  reference  signal  was  taken 
to  be  the  “new”  sample  of  the 
process  so  that  the  adaptive  filter  was  operating  as  a 
one-step  linear  predictor. 

The  evolution  of  the  mean  squared  error  over  time 
is  shown  in  Figure  1 1  for  each  of  the  multipliers  in 
Table  7.  Each  curve  was  generated  by  averaging 
the  results  of  10,000  trials.  The  convergence  factor 
p  was  carefully  adjusted  in  each  case  to  provide  a 
common  value  of  steady-state  error.  This  step  was 
necessary  because  each  update  provides  a  different 
tradeoff  between  convergence  rate  and  steady-state 
error  for  the  same  value  of  p  .  The  results  seem  to 
confirm  that  better  performance  is  available  as  the 
multiplier  becomes  more  complicated. 

2.2.3  Detection 

In  all  of  the  examples  studied  thus  far,  the 
conventional  multiplier  provides  the  best 
perfomiance  and  the  performances  of  the  estimators 
based  on  nonlinearities  are  ranked  in  order  of  the 
complexity  of  the  nonlinearity.  The  other  feature 
common  to  these  examples  is  the  use  of  Gaussian 
(or  nearly  Gaussian)  signals.  Before  any  sweeping 
generalizations  are  made  concerning  the  relative  merits  of  various  correlation  multipliers,  it 
might  be  prudent  to  run  an  experiment  using  something  other  than  Gaussian  noise.  Consider  a 
very  common  statistical  signal  processing  problem  in  communications,  the  detection  of  a  known 
signal  in  additive  Gaussian  noise.  A  simple  version  of  the  problem  is  obtained  when  a  single  bit 
of  information  is  conveyed  by  sending  a  known  signal  sequence  if  the  bit  is  a  "1"  or  sending  the 
same  sequence  inverted  if  the  bit  is  a  "0".  The  optimal  maximum-likelihood  detector  for  equally 
likely  bit  values  is  the  well  known  matched  filter  which  computes  the  inner  product  of  the  known 
sequence  with  the  received  sequence.  The  received  signal  is  modeled  as 

X,  =  Si  +  n,  (2) 


10000 


- OPTIMUM 

- SIGNED  REGRESSOR 

.  SIGNED  MAX 

- SIGNED  ERROR 

- SIGNED  PRODUCT 

-  LMS 

Figure  11.  Dynamic  behavior  of  mean  squared 
prediction  error. 
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under  and  as 

X,  =  -5,  +  (3) 

under  Hj  where  {o,- }  are  independent  and  identically  distributed  Gaussian  random  variables  with 
zero  mean  and  variance  and  is  a  known  signaling  sequence.  The  optimal  test  for  equally 
likely  messages  employs  the  output  of  a  matched  filter 

N-\ 


(4) 

/=0 

as  the  test  statistic  with  the  decision  rule  given  by:  Hq  if  D  >  0  and  choosing  H,  otherwise. 


I 

e 


O.OM 


(a) 


SNR(dB) 


0.000  H 


0.0000 1 J 


0.000001- 


(b) 


SNR  (dB) 


Figure  12.  Probability  of  error  in  Gaussian  (a)  and  Gaussian  mixture  (b)  noise. 
Solid  Line:  Matched  Filter.  Dotted  line:  Quadruplex  Detector.  Dashed  line:  Sample  and 
Sum  Detector.  Long  Dashed  Line:  Weighted  Partial  Decision  Detector. 


Suboptimal  detectors  may  be  designed  using  the  same  principles  that  were  applied  to  the 
problem  of  estimating  correlations.  There  are  two  analogs  of  the  hybrid  sign  estimator. 
Processing  the  received  signal  with  the  nonlinearity  gives  a  special  case  of  the  weighted  partial 
decision  detector  statistic 

N-l 

D=Ys,sign(x,).  (5) 

/=0 

Processing  the  known  signal  with  the  nonlinearity  yields  the  sample-and-sum  detector  statistic 

y-1 

£)=2sign(j.)x,.  (6) 

/=0 

The  detection  analog  of  the  quadruplex  correlation  estimator  is 

D='£f(x,,s,)  (j) 

where  /(•,•)  is  the  signed  maximum  function  from  Table  2. 
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A  computer  simulation  was  used  to  compare  these  three  suboptimal  detectors  with  the  matched 
filter.  A  signal  sequence  of  (0.25,0.5,  0.75,  0.5,  0.25}  was  used  to  perform  10*  simulation  runs. 
The  estimated  probability  of  error  is  plotted  as  a  function  of  signal-to-noise  ratio  in  Figure  12  (a). 
As  expected,  the  matched  filter  delivered  the  best  performance  closely  followed  by  the 
quadruplex  detector  and  then  the  sample-and-sum  detector  with  the  worst  performance  provided 
by  the  weighted  partial  decision  detector. 

The  robustness  of  each  detector  was  then  evaluated  by  altering  the  noise  model  to  a  zero  mean 
Gaussian  mixture  with  a  variance  given  by  with  a  probability  of  0.95  and  lOa^  with  a 
probability  of  0.05.  A  simulation  using  the  same  signal  sequence  and  number  of  trials  with  the 
new  noise  model  produced  the  results  shown  in  Figure  12  (b).  The  matched  filter  is  no  longer 
optimal  for  larger  signal-to-noise  ratios,  where  the  quadruplex  detector  provides  the  best 
performance.  The  weighted  partial  decision  detector,  which  provided  the  worst  performance  in 
Gaussian  noise,  also  out-performs  the  matched  filter  at  the  higher  signal-to-noise  ratios.  The 
sample-and-sum  detector  provides  the  worst  performance  under  these  conditions. 

2.2.4  Neural  Networks 

The  ideas  presented  in  the  previous  two  examples  can  be  combined  to  show  how  neural  networks 
might  be  implemented  without  multipliers.  The  simple  binary  decision  problem  described  by  (1) 

and  (2)  could  be  solved  by  the  single-layer  perceptron 
shown  in  Figure  13  with  a  suitable  training  set  of  received 
samples  correct 

decisions  if (n)€ {0,1}.  The  weights 

are  updated  according  to  the  LMS  algorithm  with  the  first 
weight  corresponding  to  a  bias  term  that  is  not  paired 
with  an  input.  The  perceptron  output  CZ^S'^elO,!}  is 

obtained  by  comparing  the  error  to  a  fixed  threshold  of 

1/2. 

Nonlinear  functions  can  replace  multipliers  in  two  places, 
in  the  weight  update  and  in  the  computation  of  the  inner 

product  In  Table  8  the  conventional 

/=i 

approach  is  compared  with  a  technique  using  the  signed 
maximum  nonlinearity  /(•,•)  from  Table  2.  Both 

algorithms  were  tested  with  a  computer  simulation  of  a 
detection  problem  similar  to  the  one  given  in  the  previous 
section.  The  noise  variance  was  0.1  and  the 
convergence  parameter  p  was  set  to  0.005.  The  weight 
vector  was  initialized  to  zero  and  each  algorithm  was 
Figure  13.  Single-layer  perceptron.  trained  using  500  sets  of  received  vectors  and  correct 

decisions.  The  probability  of  error  as  a  function  of  time 
was  estimated  by  averaging  10,000  simulation  trials. 


19 


The  performance  of  the  two  algorithms  in  the  simulation  experiment  are  compared  in  Figure  14. 
The  quadruplex  perceptron  converges  more  quickly  than  the  conventional  LMS  perceptron,  but 
the  steady-state  error  is  higher.  One  important  difference  between  this  example  and  the  previous 
example  using  the  adaptive  filter  update  is  that  the  steady-state  errors  cannot  generally  be 
equalized  by  using  different  values  for  p  .  The  decision  regions  of  the  conventional  perceptron 
are  separated  by  a  hyperplane,  but  the  quadruplex  perceptron  produces  a  nonlinear  decision 
boundary  which  may  yield  better  or  worse  steady-state  performance  [1,2,3, 4,5,6]. 

Table  8.  Conventional  LMS  and  Quadruplex  perceptrons. 


Conventional  LMS _ Quadruplex 


N 

t(«)  =  ho  (n)  +  Y,  h  (« V,  («) 

/■-I 

- ‘  1 - 

yin)  =  ho{n) + Y  f(hi  («).  ^/(«)) 

e{n)  -  d{n)-y(n) 

ho{n+\)=ho{t^+\xe{n) 

hi  («  + 1)  =  /^  («) + \3.e  (n)x,  («),/  =  !,...,// 

/=! 

e(n)  =  d(n)-y(n) 

ho(n  +  l)=holn)+iiein) 

hi(n  +  l)=  h^{n)+^f(e(n),xXn)), 

Figure  14.  Probability  of  error  trajectory  for  conv  LMS  and  quadruplex  perceptrons. 

Multiplier  shortcuts  allow  FPGAs  to  take  on  computationally  intensive  real-time  tasks  that 
challenge  the  capabilities  of  conventional  processor  chips.  Computing  architectures  based  on 
FPCrAs  and  nonlinear  processing  elements  can  be  used  to  reduce  implementation  costs  with  only 
a  small  sacrifice  of  performance.  The  use  of  a  programmable  architecture  also  reduces 
development  costs  and  allows  the  hardware  to  be  reconfigured  for  different  applications.  The 
designs  are  easily  ported  to  new  devices  with  greater  capabilities,  so  that  keeping  pace  with 
advances  in  chip  fabrication  technology  is  greatly  simplified. 
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3  Results  of  Phase  I  Investigation 

3.1  Task  1:  Mapping  Basic  CME  Processing  Functions  into  FPGAs 

3.1.1  Feasibility  of  VHDL  to  FPGA  Implementation 

Configware  systems  that  utilize  FPGAs  show  particular  promise  for  high  computational  power. 
Efficient  usage  of  such  systems  however  has  remained  elusive  due  to  a  multitude  of  factors. 
Conventional  signal  processing  algorithms  do  not  map  well  to  FPGA  resources.  This  is  because 
their  hierarchical  gate  array  structures  require  specific  mapping  structures  for  parallelization  and 
pipelining.  VHDL  code  (see  appendix)  has  been  developed  for  evaluation  purposes.  The 
prototype  VHDL  code  was  compiled  and  synthesized  via  Mentor  Graphics  using  QuickHDL  and 
Exemplar  s  Leonardo  CAD  toolkit.  Altera  MaxPlus2  and  Xilinx  Foundation  Series  software  was 
also  used  for  R&D  primarily  for  advanced  synthesis  and  routing  placement  into  specific  FPGAs. 
Other  software  has  been  considered  and  may  be  used  in  the  future;  those  by  Cadence  and 
Synopsis  since  they  are  standard  industry  CAD  vendors.  The  feasibility  of  generating  FPGA 
designs  automatically  from  a  high-level  description  of  the  algorithm  has  become  established  in 
the  commercial  industry  using  VHDL  and  Verilog  software  code  formats.  CAD  tools  by  the 
commercial  vendors  listed  above  are  in  good  standing,  although  it  must  be  noted  that  they  are  in 
a  state  of  constant  upgrade  due  to  the  plethora  of  technological  breakthroughs,  (note:  LC  = 
Logic  Components,  ns  =  nanosec  in  Figures  15  and  16). 

Figure  15:  Bit-Serial  FPGA  Implementation  Costs 
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3.1.2  VHDL  Implementations  Comparison  of  CMEs:  Area  vs.  Time  Performance 

Cost  measurements  in  terms  of  logic  cells  utilized  versus  the  delay  time  for  determining  the 
throughput  were  taken  for  serial  and  parallel  implementations.  The  parallel  implementation 
results  and  the  serial  implementation  results  for  the  various  coirelation  estimators  are  shown  (in 
Figs.  15  and  16)  per  data  bit  to  get  an  accurate  comparison.  Obviously,  the  area  consumed  per  bit 
has  decreased  and  the  throughput  rate  per  bit  has  increased  with  parallelization.  The  results  from 
this  study  indicate  that  the  parallel  implementation  of  the  quadruplex  non-linear  estimator  is 
optimal. 


Figure  16:  Bit-Parallel  Implementation  Costs 


3.1.3  FPGA  Styles 

FPGA  technology  is  constantly  evolving  in  accordance  with  Moore's  Law.  Virtex  is  the 
next  generation  of  Xilinx  FPGA  technology;  a  new  series  of  high-performance,  high-density, 
system-level  devices  with  a  revolutionary  new  architecture,  built  with  leading  0.25  micron 
process  technology.  The  Virtex  architecture,  with  its  2.5-volt  supply  voltage,  offers  the  industry's 
first  devices  capable  of  directly  interfacing  beyond  CMOS  and  TTL  logic.  Voltage  supply  levels 
are  an  intrinsic  measure  for  low-power  hybrid  and  configurable  computing.  Other  interesting 
developments  include  the  rapid  reconfiguration  capability  of  the  new  Xilinx  XC6200  series 
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FPGA  chips,  which  will  enable  algorithmic  swapping  capability  for  dynamic  computation. 
Conventional  DSP  processors  currently  have  the  advantage  for  floating  point  processing,  such  as 
the  new  Texas  Instruments  C60  processors.  These  new  developments  are  worth  further  pursuit. 
The  proposed  ReConfigurable  Processor  architecture  resulting  from  Task  3  will  be  an  enabling 
technology  for  BMDO  applications,  but  Phase  2  funding  is  critical  for  its  development. 

3.2  Task  2:  Reduced  Complexity  CMEs  Interconnected  within  Monolithic  FPGA 

In  real-time  array  signal  processing  and  other  advanced  signal  processing  techniques  formulating 
correlation  matrices  are  necessary.  The  conventional  method  for  estimating  the  correlation 
matrix,  is  to  form  a  sum  of  products  from  the  received  sensor  array  data.  This  technique  for 
generating  the  correlation  estimator  is  computationally  expensive  for  large  sample  sizes. 

In  an  effort  to  reduce  the  complexity,  the  statistics  of  the  correlation  matrix  will  be  investigated. 
The  conventional  method  involves  the  expected  value  of  a  product  R  =  E[XY] .  Consider  a 

mapping  of  the  random  variables  X,  Y  such  that  the  expected  value  of  the  product  is  a  constant 
times  the  product  of  the  expected  values,  E[XY]  =  cE[Z]E[W] .  In  the  case  where  X  and  Y  are 
zero  mean  Gaussian  random  variables,  and  using  Bussgang’s  theorem  [1]  simple  transformations 
can  be  created  with  the  goal  of  reducing  the  computational  expense.  Using  a  transformation,  Q  = 
h(Y),  the  correlation  between  X  and  Y,  R^y  can  be  formulated  from  the  correlation  between  X 
and  Q,  the  correlation  between  Y  and  Q  and  the  correlation  between  Y  and  Y  which  are  RqxRqy 
Ryy  respectively  [4]. 


One  such  simplification  is  Q  =  h(y)  =  sign(Y).  The  correlation  of  Y  is  given  by 

R„  =£[r)  =  ^{£{|y|})= 

when  Y  is  zero  mean  Gaussian.  The  correlation  R^yis  now  reduced  to  the  product  to  two, 
simplistic  correlations  Rxq  and  Ryg. 

R  =  R  R 
^  /  2 

So  the  correlation  is  now  of  the  form  E[XY]  =  cE[Z]E[W]  with  Z  =  h(Y)  Y  =  |Y|  and 

W  =  h(Y)  X  =  sign  (Y)  X.  Over  a  finite  random  sample  (X;  Yj)  i  =  1,  2  , . .  .,N  the  correlation’s 
Rxq  and  Ryq  can  be  estimated  by 

(=0 

and 

^yQ 

1=0 

thus  formulating  the  Hybrid  Sign  estimator  [3]. 

An  alternative  approach  to  estimating  the  mean  of  a  product  is  to  use  a  combination  of  variances 
fromulating  the  identity  AT  =  [(X  +  7)^  -  (Y  -  7) '  ]  /  4  [2].  If  X  and  Y  are  zero  mean  Gaussian 
random  variables  then 
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E{{X  +  Yf]  E{{X-Yf}  _  / 
E^{\X  +  Y\)  EWX-Y\}  /2 
therefore 


E{{X +  Yf -E{{X -Yf)  , 

E{XY}  = - -1^ - Ll  =  y^E^^\X  +  Y\}-E^-{\X-Y\} 

In  the  formula  above  E{XY}  can  be  rewritten  as  the  product  of  to  expectations  as  follows: 

E{{X  +  Yf  -E{{X-Yf)  .  ,  ,  ,  ’ 

= ^  =  {|^  +  r|}-£'{|  Y-y|}wherec  =  ^ 


=  cE{\X  +  Y\  +  \X  -Y\}*E{\X  +  Y\-\X  -Y\}  =  cE[W]E[Z] 
Thus  the  quadruplex  transformation  is  defined  as 
Z  =  [\X  +  Y\  +  \X-YWI2  =  m2^{\X\,\Y\) 
and 


W  -  [\X  +  Y  \  -\X  -Y  \]l  2  =  5/g«(AT)  min(|  X  |,|  Y  |) . 

This  quadruplex  transformation  can  be  represented  as  the  following  table 


f(X,Y) 

8(X,Y) 

-X  <=  Y  <= 

X 

Y 

X 

-Y  <=  X  <= 

Y 

X 

Y 

-X  >=  Y  >= 

X 

-Y 

-X 

-Y>=X>= 

Y 

-X 

-Y 

In  the  interest  of  mobile  communicafions  complex  representation  of  both  of  these  correlation 
estimators  are  necessary.  The  covariance  for  a  joint  distribution  for  two  complex  bivariate 
Gaussian  random  variables,  Z  =  X+j  Y  and  W  =  U+jV,  with  means 

E[Z]  =  E[W]  =  0  and  variances  E[ZZ]  =  E[WW]  =  is 
R^^E[ZW]  =  R^+jR^ 


The  complex  hybrid  sign  estimator  replaces  the  sign  nonlinearity  with  a  phase  nonlinearitv 
QMire^)  =  e  Zm 

where  M  is  an  integer  specifying  the  quantization  in  bits  and  ^  €  (0,2;r)  is  a  arbitrary  phase  shift 

and  int(*)  is  the  “nearest  integer”  operation.  By  choosing  M  =  4  and  introducing  a  phase  shift  of 
A  the  sign  function  is  now 


csign  (x  +  iy)  = 


{sign^x)  +  jsigniy)} 


which  can  take  on  one  of  four  values  in  the  set  {1,-1,  i,  -i}  depending  on  the  real  and  imaginary 
parts  of  Z  being  positive  or  negative.  To  simplify  matters  the  calculation  of  the  product 
Wcsign(Z)  can  be  formulated  in  the  following  table. 
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Z  =  X  +  iY, 

W  =  U  +  iV 

Re{Wcsign(Z)} 

Im{Wcsign(Z)} 

X>=0,  Y>= 

0 

U 

V 

X  <=  0  ,  Y  >= 

0 

-V 

u 

X>=0,  Y<= 

0 

V 

-u 

X  <=  0  ,  Y  <= 

0 

-u 

-V 

The  complex  form  for  the  quadruplex  estimator  is  obtained  by  using  the  real  form  of  the 
quadruplex  transformation  and  appliying  it  to  the  real  and  imaginary  parts  of  the  covariance  as 
described  below. 


and  ~ a, b, 

where 


=  ^xu  +  hv  >  =  ^xu  +  hv » 

and 


^xv 


^YU 


y  —  bj^y  +  b] 


^YU 


1  ^  1  ^ 

^RS=-f^Y. m )  and  4s  =  - E g{Ri ,5,- ) 

1=0  ”  ,=o 

where  f,  g  are  the  same  as  in  the  real  quadruplex  transformation. 


The  reduced  complexity  of  the  CMEs  allows  many  of  these  elements  to  be  created  and 
instantiated  into  a  monolithic  FPGA  structure.  They  are  simply  interconnected  in  a  fashion  that 
is  referred  to  as  “systolic”.  This  indicates  that  input  and  outputs  from  one  element  to  another 
flow  in  a  modular,  regular  and  point-to-point  communication  avoiding  broadcasting  and  a 
standard  bus  interconnect  protocol.  An  advantage  to  this  method  is  that  it  is  naturally 
implemented  in  FPGAs  and  can  result  in  fast  pipelined  operations. 

3.3  Task  3:MuUiple  Interconnected  FPGA  ReConfigurable  Processor  (RCP) 
Architecture 

Configware  systems  like  the  proposed  RCP  (ReConfigurable  Processor)  described  in  the  next 
section  are  fundamentally  defined  by  their  ability  to  adapt  or  completely  reconfigure  itself, 
making  them  suitable  for  a  whole  host  of  traditional  adaptive  algorithms  and  also  enabling  them 
to  significantly  reduce  the  actual  amount  of  hardware  required  in  an  overall  system.  The  potential 
for  small  form  factor  makes  configware  suitable  for  embedded  systems  within  today’s  most 
advanced  systems.  Although  current  generation  FPGAs  do  not  have  the  circuit  density  nor  low 
power  advantages  as  do  custom  ASICs  for  small  form  factor  wireless  products,  they  can  still 
significantly  reduce  the  footprint  in  the  other  electronics  in  UAV  (Unmanned 
Autonomous/Airbome  Vehicles)  or  mobile  base  or  ground  stations. 
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3.3.1  Proposed  FPGA  RCP  Core  Architecture  Overview 

The  ReConfigurable  Processor  (RCP)  will  be  designed  as  a  flexible,  high  internal  bandwidth 
application  specific  co-processor  that  fits  into  a  PC  chassis  utilizing  a  Peripheral  Chip  Interface 
(PCI).  Its  primary  advantage  over  other  systems  is  its  ability  to  adapt  to  do  multiple  functions 
efficiently.  It  is  envisioned  that  the  RCP  can  revolutionize  signal  processing  and  data  transfer 
tasks  by  providing  extremely  fast  and  flexible  Digital  Signal  Processing  (DSP)  functionality  for 
equalization,  forward  error  correction,  synchronization,  encryption/decryption, 
compression/decompression,  to  name  a  few.  However  being  a  configware  platform,  the  RCP  is 
able  to  move  beyond  its  original  purpose.  The  flexible  architecture  of  the  RCP  board  allows  it  to 
be  utilized  in  a  variety  of  applications  and  it  is  particularly  suitable  for  any  application  that 
demands  high  computational  power.  The  RCP  can  be  dynamically  reprogrammed  and 
reconfigured  for  a  synchronous  or  asynchronous  series  of  tasks.  In  addition,  the  RCP  is  able  to  do 
parallel  tasks  by  loading  up  multiple  algorithms  and  running  them  at  the  same  time.  For  digital 
signal  processing  alone,  this  opens  up  a  whole  host  of  possibilities.  These  possibilities  range 
from  executing  multiple  DSP  tasks  on  a  single  channel  to  switching  DSP  tasks  among  a  group  of 
heterogeneous  channels.  In  particular,  one  of  the  foremost  design  specifications  for  the  RCP  is  it 
can  allow  adaptive  DSP  algorithms  to  run  in  real-time.  The  block  diagram  on  the  next  page 
shows  the  architectural  design  flow  of  the  processor. 

The  RCP  architecture  layout  consists  of  4  FPGA  processing  elements  (PE0->PE3  PLDs)  that  are 
connected  to  each  other  through  direct  local  buses  and  through  4  field  programmable 
interconnect  ICube  IQX160  PSID  chips.  These  processing  elements  each  have  their  own  fast 
dual  port  32k  x  1 6  RAM  that  can  be  directly  accessed  by  the  PSIDs.  This  flexibility  enables  the 
designer  to  have  multiple  degrees  of  freedom  for  architecture  implementations  since  one  of  the 
problems  that  designers  often  face  when  using  other  advanced  boards  are  architectural 
limitations.  Architectural  limitations  have  traditionally  lead  to  significantly  higher  design  delays 
for  smart  sensors.  Two  more  FPGA  (Pre  &  Post  PLDs)  processing  elements  are  utilized  to  take 
care  of  pre-  and  post-processing  tasks  such  as  data  conversion,  pruning  and  sorting.  These 
processors  are  connected  to  the  PCI  bus  through  4k  x  18  FIFOs.  The  post-processor  is  further 
connected  to  a  1  M  x  32  single  port  RAM  for  data  storage  of  array  calculation  results  and  for 
storing  additional  data  specifically  for  packet  burst  communications.  Dedicated  PCI  control  logic 
is  used  for  programming,  configuration  and  PCI  interface  tasks.  A  global  bus  connects  all  the 
processors  together  to  a  PCI/Configuration  Controller  device.  The  RCP  is  to  be  designed  with 
multiple  easily  accessible  logic  analyzer  test  headers  and  LEDs  to  aid  with  circuit  debugging  and 
for  display  purposes. 

The  next  few  sections  give  descriptions  of  some  of  the  algorithms  that  are  suitable  for  the  RCP. 
The  discussion  starts  with  the  Viterbi,  Reed-Solomon,  and  Turbo  Coding,  which  are  used  in 
communication  systems  for  forward  error  correction  (FEC)  signaling;  followed  by  digital 
beamforming  for  smart  antenna  array  computing;  and  then  flight  telemetry  processing.  Following 
the  discussions  on  FEC,  digital  beamforming  and  flight  telemetry,  the  discussion  delves  into 
advanced  hardware  topics  such  as  rapid  reconfiguration  and  evolvable  hardware  for  smart 
sensors.  These  represent  but  just  a  few  of  the  areas  where  an  RCP  can  utilized  to  create  optimal 
hardware  implementations  of  algorithms  and  furthermore,  to  enable  new  algorithms  to  be 
developed  quickly. 
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RCP  SIGNAL  BLOCK  DIAGRAM 


3.3.2  Forward  Error  Correction  (FEC)  Signal  Coding 
Viterbi  Decoding 

Digital  communications  will  fundamentally  transform  communication  means  &  multiple 
system  interaction  as  it  takes  a  ubiquitous  role  in  all  electronics.  It  is  acknowledged  that 
the  radio  frequency  (RF)  links  in  the  network  will  be  largely  asymmetric,  with  wide 
bandwidth  (video,  audio,  data)  required  on  the  downlink  (basestation  to  remotes),  and 
low  bandwidth  (audio,  keypad)  required  on  the  uplink  (remotes  to  basestation).  Despite  a 
relatively  low  bandwidth,  the  uplink  connection  presents  some  difficult  technical 
challenges.  FCC,  health  and  safety  regulations,  and  available  battery  reserves  limit  the 
transmitter  power  levels  to  less  than  600  mW  for  a  handheld  device,  and  practical 
considerations  limit  the  transmitting  antenna  size  to  a  small  fraction  of  a  meter.  Since 
low  signal  to  noise  ratio  (SNR)  links  either  impede  the  achievable  data  rate  or  require 
excess  bandwidth,  highly  efficient  usage  of  the  available  transmitted  signal  is  a 
paramount  design  criterion.  Long  constraint  length  (LCL)  Viterbi  decoders  are  already 
used  in  such  applications  as  deep  space  communications  and  voice  recognition,  however 
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despite  the  coding  gain  advantage,  they  are  slow  to  find  widespread  commercial 
acceptance.  LCL  convolutional  coding  with  Viterbi  decoding  can  significantly  improve 
the  reception  of  digital  data.  For  an  uplink  implementation,  a  handheld  communication 
device  would  encode  transmitted  data  with  a  simple  convolutional  encoder  and  the 
basestation  would  receive  and  decode  with  the  rather  more  complex  Viterbi  decoder.  The 
cost  of  a  more  complicated  decoder  present  in  the  smaller  population  of  basestations  is 
expected  to  be  lower  than  other  methods  of  increasing  the  received  SNR  such  as  larger 
basestation  antennae.  Furthermore,  as  IC  technology  shows  a  trend  to  increasing 
transistor  capacity  and  lowered  costs,  this  trade-off  will  continue  to  favor  LCL  Viterbi 
decoding  implementations. 


Figure  18.  System  block  diagram. 

A  typical  system,  as  shown  in  18,  consists  of  a  data  source,  channel  encoder,  modulator, 
noisy  channel,  demodulator,  channel  decoder  and  receiving  system.  The  channel  encoder 
and  decoder  are  used  to  correct  errors  in  the  information  stream  caused  by  channel  noise. 
This  is  known  as  Forward  Error  Correction  (FEC).  For  a  given  SNR,  the  coding  gain  is 
defined  as  the  power  level  increase  (in  dB)  required  by  an  uncoded  channel  to  match  the 
error  performance  of  a  coded  channel.  Hence  coding  gain  is  a  measure  of  the 
effectiveness  of  a  particular  code  with  larger  values  an  indication  of  better  performing 
codes.  Table  9  shows  the  asymptotic  coding  gain  of  a  rate  1/2  convolutional  code 
assuming  BPSK  with  coherent  detection.  Longer  constraint  length  codes  obviously 
perform  better,  but  require  substantially  more  computation. 


Table  9.  Coding  gain  of  several  good,  rate  1/2  convolutional  codes 


Constraint  Length 

Asymptotic  Coding  Gain  (dB) 

3 

.97 

7 

3.98 

11 

5.44 

15 

6.53 

Mapping  and  partitioning  are  an  active  area  of  research  in  VLSI  systems.  As  presented  in 
[7]  [8]  [9],  Index  Mapping  is  a  technique  to  efficiently  map  a  widely  used  class  of 
algorithms  onto  a  space/time  paradigm  with  immediate  representation  as  the  partitioning 
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and  scheduling  map  for  the  algorithm  onto  a  small,  I/O  efficient,  VLSI  array.  This  class 
includes  the  Viterbi  Algorithm  (VA),  the  Fast  Fourier  Transform  (FFT),  and  the  Bitonic 
Sort,  among  others.  A  Fixed  Shuffle  Viterbi  Decoder  using  this  technique  can  be  targeted 
for  the  ReConfigurable  Processor. 

Reed-Solomon  Implementation  in  RCP 

The  code  created  by  Irving  Reed  and  Gustav  Solomon  in  I960  provides  a  powerful 
method  for  forward  error  correction  of  burst  errors.  In  addition,  it  has  an  extremely  low 
undetected  error  rate.  This  means  that  the  decoder  can  reliably  indicate  whether  it  can 
make  the  proper  corrections.  Reed-Solomon  codes  have  been  widely  used  for  error 
detection  and  correction  for  storage  media  such  as  compact  disks  and  magnetic  hard 
drives.  Other  important  uses  include  video  data  error  correction.  Unlike  convolutional 
codes,  it  is  a  block  code  and  corrects  any  number  of  errors  in  a  transmitted  symbol. 
Therefore  they  have  also  been  implemented  as  an  “outer  code”  surrounding  a  Viterbi 
convolutional  error  correction  design  to  improve  decoder  bit  error  rate  performance. 

The  Voyager  spacecraft  and  Hubble  Space  Telescope  are  two  examples  of  space 
communications  systems  which  have  also  used  Reed-Solomon  error  correction.  For 
satellite  communications,  Reed-Solomon  codes  are  fi'equently  incorporated  for  forward 
error  correction.  In  fact,  the  Consultative  Committee  for  Space  Data  Systems  (CCSDS) 
has  standardized  a  Reed-Solomon  format  (223  information  symbols,  16  symbol 
correction  capability  per  codeword)  which  is  employed  by  several  satellite  systems  and 
the  National  Aeronautical  and  Space  Administration  (NASA).  It  is  the  format  suitable  for 
the  incorporation  of  the  Reed-Solomon  algorithm  into  the  proposed  RCP. 

The  Reed-Solomon  decoding  process  is  a  five-stage  process  [10]  as  shown  in  Figure  19 

1.  Calculate  the  syndrome  components  from  the  received  word.  A  syndrome  is  created 
from  a  cyclic  shifting  of  the  most  probable  burst  pattern. 

2.  Calculate  the  error-locator  word  (polynomial)  from  the  syndrome  components. 

3.  Calculate  the  error  locations  from  the  error-locator  numbers  which  are  from  the  error- 
locator  word. 

4.  Calculate  the  error  values  from  the  syndrome  components  and  the  error-locator 
numbers. 

5.  Calculate  the  decoded  code  word  from  the  received  word,  the  error  locations,  and  the 
error  values. 
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Figure  19.  Reed-Solomon  decoder  block  diagram. 

Originally  the  Reed-Solomon  decoding  algorithm  was  performed  in  software  due  to  the 
computational  power  required.  However,  wifh  the  advances  in  VLSI  technology,  and  the 
development  of  several  architectures  which  reduced  the  design  complexity,  the  decoder 
has  also  been  realized  in  integrated  circuits.  As  illustrated  in  the  table  10  [11],  an  FPGA 
implementation  provides  a  significant  speed  improvement  over  a  software  realization.  In 
addition,  the  FPGA  approach  provides  a  better  emulation  of  a  custom  integrated  circuit 
design.  An  FPGA  implementation  can  provide  the  flexibility  to  prototype  various 
formats  prior  to  committing  to  a  final  architecture. 


Table  10.  A  comparison  of  architectural  implementations. 


Implementation 

Throughput 

Microprocessor  (Hypersparc) 

46.2  kB/s 

.035 

DSP  Processor  (TMS320C30) 

18.75  kB/s 

.09 

FPGA 

lOMB/s 

20.33 

ASIC  (LSI  Logic) 

40  MB/s 

406 

Turbo  Coding  for  Advanced  Communications 

Turbo  Codes  were  introduced  in  1993  [12]  and  are  considered  among  the  most  important 
developments  in  coding  theory.  Researchers  around  the  world  have  been  able  to  extend 
the  basic  idea  to  other  forms  of  code  concatenations,  with  various  applications  to 
transmission  over  fading  channels,  band-limited  satellite  channels,  and  channels  with 
intersymbol  interference.  A  turbo  code  is  formed  by  two  simple  convolutional  codes 
separated  by  an  interleaver.  The  decoder  consists  of  two  Soft-Input  Soft-Output  (SISO) 
modules  cormected  by  an  interleaver  and  a  deinterleaver. 

A  Turbo  encoder  is  a  combination  of  two  simple  encoders  [13].  The  input  is  a  block  of  K 
information  bits.  The  two  encoders  generate  parity  symbols  from  two  simple  recursive 
convolutional  codes,  each  with  a  small  number  of  states.  The  information  bits  are  also 
sent  uncoded.  The  key  innovation  of  turbo  codes  is  an  interleaver  P,  which  permutes  the 
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original  K  information  bits  before  inputting  them  into  the  second  encoder.  The 
permutation  P  causes  the  input  sequences  for  an  encoder  that  produces  low-weight 
codewords  to  be  modified  such  that  the  other  encoder  produces  high-weight  codewords. 
Thus,  even  though  the  constituent  codes  are  individually  weak,  the  combination  can  be 
surprisingly  powerful.  The  resulting  code  has  features  similar  to  a  “random”  block  code 
with  K  information  bits. 


TURBO  ENCODER  2  TURBO  DECODER 


Figure  20.  System  diagram  for  a  Turbo  codec. 

Random  block  codes  are  known  to  achieve  Shannon-limit  performance  as  K  gets  large, 
but  at  a  price  of  a  prohibitively  complex  decoding  algorithm.  Turbo  codes  mimic  the 
good  performance  of  random  codes  (for  large  K)  using  an  iterative  deeoding  algorithm 
based  on  simple  decoders  individually  matched  to  the  simple  constituent  codes.  Each 
constituent  decoder  sends  a  posteriori  likelihood  estimates  of  the  decoded  bits  to  the 
other  decoder,  and  uses  the  corresponding  estimates  from  the  other  decoder  as  a  priori 
likelihoods.  The  uncoded  information  bits  (corrupted  by  the  noisy  channel)  are  available 
to  each  decoder  to  initialize  the  a  priori  likelihoods.  The  decoders  use  the  “MAP” 
(Maximum  a  Posteriori)  bit-wise  decoding  algorithm,  which  requires  the  same  number  of 
states  as  the  well-known  Viterbi  algorithm.  The  Turbo  decoder  iterates  between  the 
outputs  of  the  two  constituent  decoders  until  reaching  satisfactory  convergence.  The  final 
output  is,  for  this  particular  algorithm,  a  hard-quantized  version  of  the  likelihood 
estimates  of  either  of  the  decoders. 

In  addition  to  providing  improved  performance,  1.0  dB  compared  to  NASA’s  Cassini 
concatenated  code,  turbo  decoders  are  lower  in  complexity.  Turbo  codes  constructed  from 
two  16-state  constituent  codes  can  achieve  this  signal-to-noise  ratio  (SNR)  performance 
for  10'^  bit  error  rates  (BER)  independent  of  the  encoding  rate.  Turbo  decoding  time  is 
proportional  to  the  number  of  states  and  the  number  of  iterations,  unless  special-purpose 
hardware  such  as  the  RCP  is  used  to  parallel-process  some  or  all  of  the  states.  The 
number  of  states  for  turbo  decoders  is  orders  of  magnitude  smaller  than  the  Cassini 
decoder,  and  this  more  than  offsets  the  modest  additional  amount  of  iterations  required. 
Interleaver  size  impacts  buffer  requirements  and  decoding  delay,  but  not  decoding  time.  It 
is  the  primary  determinant  of  Turbo  code  performance.  Turbo  code  performance  increases 
as  the  interleaver  size  grows. 
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For  latency  critical  applications  such  as  PCS  for  two-way  vocal  communication,  the 
Turbo  codes  presented  by  Berrou,  Glavieux,  and  Thitimajshima  [12]  suffer  from  a  large 
interleaver.  Thus  the  Turbo  decoder  is  inherently  limited  (independent  of  computation 
resources)  because  the  de-interleaver  must  wait  through  a  relatively  large  number  of 
symbols  to  begin  decoding.  More  recently,  the  work  of  Divsalar  and  Pollara  [14],  have 
derived  Turbo  codes  with  greatly  reduced  interleaver  sizes  at  a  modest  performance 
penalty.  These  are  far  more  suitable  to  low  latency  applications  and  represent  a  good 
application  for  the  RCP  in  basestations  for  Turbo  Decoding  in  PCS. 

To  implement  Turbo  Coding  on  the  proposed  RCP,  several  architectural  observations 
need  to  be  made.  The  encoding  and  decoding  features  are  distinct.  This  distinction 
necessitates  different  implementations.  For  real-time  two  way  communication  based  upon 
Turbo  Coding  with  minimum  footprint  requirements,  the  rapid  reconfiguration  feature  of 
the  RCP  can  be  utilized  to  switch  between  implementations.  Of  the  two  implementations, 
the  decoder  has  the  most  complexity;  so  it  will  be  discussed  here.  A  Turbo  decoder  has  at 
its  core  several  small  Maximum  Likelihood  Sequence  Estimators  (Viterbi  decoders)  that 
are  cross  linked  through  de- interleavers  and  run  iteratively  on  the  received  data  sequence. 
Thus,  the  most  obvious  implementation  for  the  RCP  is  to  allocate  a  single  small  Viterbi 
decoder  with  soft  decisions  into  each  PE  element.  Then,  the  dual  port  memories  serve  to 
buffer  and  de-interleave  the  PEs  from  each  other.  The  post-  and  pre-processors  are 
responsible  in  this  situation  for  all  other  activity  like  control  and  data  reconstruction.  A 
more  advanced  implementation  would  be  to  rely  upon  the  rapid  reconfiguration  capability 
of  the  RCP  to  switch  between  parallel  Viterbi  decoding  and  serial  de-interleaving  features 
of  Turbo  decoders. 

Turbo  Coding  offers  significant  SNR  improvements  that  are  particularly  valuable  in 
satellite  telemetry  systems.  The  RCP  board  is  intended  for  satellite  telemetry  systems  that 
are  readily  configurable  to  a  wide  variety  of  communication  protocols  and  signal 
processing  requirements,  an  example  application  being  digital  beamforming  and  blind 
signal  separation. 

3.3.3  Digital  Beamforming 

The  recent  explosion  in  cellular  and  satellite  communications  has  propelled  a  concurrent 
effort  in  the  bandwidth  expansion  of  these  technologies.  An  important  technique  in 
increasing  data  throughput  in  these  systems  is  the  use  of  digital  beamforming. 
Beamforming  aims  the  input  beam  in  a  particular  direction  and  attenuates  incoming 
beams  from  other  directions.  Beamforming  avoids  the  mechanical  wear  and  tear  that  a 
rotating  antenna  undergoes  yet  still  provides  the  improved  signal  reception  due  to 
mitigation  of  unwanted  signals  from  other  directions.  The  fact  that  digital  beamforming  is 
used  avoids  the  drift  and  calibration  difficulties  that  accompany  the  use  of  analog 
processing.  Unfortunately,  the  possible  beam  widths  are  often  quite  large  and  may  admit 
signals  from  several  sources  beyond  those  of  interest.  In  this  case,  further  blind  signal 
processing  may  separate  out  the  users  from  one  another  and  allow  isolation  and  retrieval 
of  the  signals  of  interest  [15].  These  methods  used  in  conjunction  will  improve  the 
bandwidth  available  for  cellular  and  satellite  communications  making  possible  exciting 
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new  advances  in  areas  such  as  vehicular  technology.  Systems  such  as  On-Star  which 
provide  an  array  of  services  from  ambulance/police  alert  to  navigation  assistance  rely  on 
an  increased  bandwidth.  The  expansion  of  these  services  to  web  browsing  and  video  on 
demand  will  compel  even  greater  bandwidth  which  beamforming  and  blind  signal 
separation  will  help  to  provide. 

Array  processing  is  the  processing  of  signals  from  an  array  of  sensors.  These  sensors  may 
be  an3dhing  from  microphones  to  superconducting  quantum  interference  devices.  In 
satellite  and  cellular  communications,  we  usually  consider  an  array  of  antennas.  Figure  21 
shows  the  antenna  array  used  to  do  digital  beamforming  in  the  Kyoto  experiment  [16], 
while  Figure  22  shows  the  array  of  beams  formed. 


Figure  21.  A  4  x  4  antenna  array  for  electronic  beam  steering. 

The  objective  of  beamforming  is  to  delay  (although  attenuation  may  also  be  used)  the 
signal  in  each  antenna  so  as  to  constructively  sum  the  signals  from  a  direction  of  interest 
while  destructively  summing  the  signals  from  other  directions  (where  the  satellite  or  user 
of  interest  is  in  the  direction  of  interest).  In  the  antenna  array  space,  this  amounts  to 
taking  the  inner  product  of  the  input  signal  with  the  antenna  steering  vector  in  the 
direction  of  interest  on  the  antenna  array  manifold.  For  the  antenna  array  in  the  Kyoto 
experiment,  a  2-d  FFT  was  taken  of  the  incoming  signal  and  the  resulting  signal  was 
filtered  to  keep  the  signal  of  interest  and  remove  unwanted  signals.  The  digital  processing 
stage  of  the  system  is  shown  in  Figure  23. 
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Figure  22.  The  array  of  beams  formed  by  the  4x4  antenna  array. 

The  core  technologies  used  in  this  particular  experiment  have  been  around  for  some  time. 
However,  how  the  authors  applied  the  technologies,  especially  the  electronic  beam 
steering  to  enhance  signal  reception,  is  a  novel  matter  for  the  very  important  forthcoming 
generation  of  SATCOM  based  communication  services.  Electronic  beamforming  and 
beam  steering  in  the  past  has  been  primarily  in  applications  where  array  was  stationary. 
With  the  advent  of  commercial  digital  SATCOM  services  (DBS,  the  Internet,  etc.), 
enhancing  reception  is  paramount,  and  vehicular  technology  is  a  huge  potential  market 
for  these  new  SATCOM  services.  Current  wireless  technology  is  quite  unreliable  in 
vehicular  situations  and  has,  as  a  result,  produced  a  rapidly  growing  market  that  demands 
higher  levels  of  performance  and  reliability.  The  market  potentials  are  significant. 

An  antenna  array,  as  shown  in  Figure  24,  was  mounted  on  the  roof  of  an  RV  and  driven 
throughout  Kyoto  to  test  the  effectiveness  of  the  beamformer  in  practical  situations.  In 
particular,  the  beamformer  was  tested  under  partial  and  heavy  shadowing  conditions.  The 
system  was  able  to  track  the  satellite  and  reacquire  the  signal  quickly  under  both  light  and 
heavy  shadowing.  Figure  25  shows  the  performance  of  the  system  under  shadowing 
conditions,  i.e.,  as  it  passes  underneath  a  bridge.  The  signal  level  for  the  array  and  for  a 
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Figure  24.  The  RV  equipped  with  the  roof  mounted  DBF  antenna  array. 

single  antenna  are  also  shown.  The  antenna  array  provides  approximately  lOdB 
improvement  in  signal  power  over  a  single  antenna,  so  the  array  provides  significant 
improvement  in  SNR  over  a  single  antenna. 
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Shadowing  by  an  overpass 


Figure  25.  Performance  of  the  antenna  array  versus  a  single  element  under 

shadowing. 


The  system  was  implemented  using  10  FPGAs  on  a  printed  circuit  board  (8  implement 
coherent  detection  and  2-d  FFT,  and  2  implement  beam  selection,  control,  and  interface). 
The  FPGA  board  is  shown  in  Figure  26 


Figure  26.  A  DSP  board  with  ten  Xilinx  FPGAs. 


Each  FPGA  chip  in  the  above  system  had  approximately  13,000  gates,  so  that  the  entire 
system  consisted  of  130,000  gates.  However,  current  FPGA  chips  have  a  much  higher 
gate  density,  and  the  previous  system  could  be  fit  on  a  single  FPGA  chip.  Therefore,  if 
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the  above  system  were  implemented  in  state  of  the  art  technology  such  as  in  the  RCP,  the 
capability  for  implementing  more  advanced  algorithms  is  realizable  due  to  architectural 
enhancements. 

The  beamforming  method  has  some  limitations.  As  shown  by  the  beam  pattern  in  Figure 
22,  the  resolution  of  the  beams  is  quite  large,  making  beamforming  of  limited  use  when 
sources  are  close  together  relative  to  their  distance  from  the  antenna  array.  In  this 
situation,  we  propose  further  signal  processing  to  pull  the  desired  signals  out  from  the 
composite  signal.  In  particular,  we  propose  to  separate  the  signals  using  blind  signal 
separation  [15].  In  this  technique  the  signals  from  multiple  sensors,  each  of  which 
receives  some  combination  of  the  signal  of  interest  and  the  interfering  signal,  are 
unmixed  using  very  general  properties  of  the  signal.  The  attractive  feature  of  this 
technique  is  that  the  necessary  property  holds  for  most  signals.  This  requires  signals  from 
several  beams  of  the  beamformer  be  used.  This  technique  has  been  shown  to  separate  up 
to  10  speakers  in  an  audio  system  where  the  sensors  consist  of  an  array  of  microphones 
[17].  This  technique  could  potentially  reduce  an  interfering  signal  down  to  a  Gaussian 
noise  source  in  the  system,  i.e.,  interference  attenuations  on  the  order  of  10  dB  or  more 
would  be  possible. 


3.3.4  Flight  Telemetry  Processing 

Telemetry  is  the  process  of  transmitting  data  measured  at  a  remote  location  to  a  distant 
receiving  station  for  recording  and  analysis.  Real-time  recording  and  analysis  involves 
very  high  rates  of  signal  processing.  The  proposed  RCP  would  provide  the  telemetry 
community  with  a  low-cost,  high-performance  integrated  configware  system  for  signal 
processing  that  fits  right  into  a  PCI  Mezzanine  board  slot  in  a  Personal  Computer  (PC)  as 
an  alternative  to  complex,  expensive,  and  bulky  telemetry  systems.  As  a  innovative 
flexible  solution,  the  RCP  can  be  reconfigured  at  will  to  do  whatever  telemetry  task  or 
control  function  is  desired  [18].  So  the  RCP  can  be  used  for  high-speed  data  I/O 
processing  and  storage,  including  frame  synchronization,  decommutation,  telemetry 
simulation,  FEC  decoding  and  encoding,  randomization,  reception,  bit  and  frame 
synchronization,  time  code  processing,  NASCOM  blocking  and  deblocking,  NASCOM- 
IP  conversion,  and  high-speed  disk  logging.  Since  the  RCP  is  independent  of  the 
evolving  standards  in  satellite  telemetry  used  in  global  communications,  the  RCP  would 
be  able  to  handle  different  standards  such  as  TDM,  CCSDS  and  NASCOM.  To  meet 
performance  needs,  the  RCP  enables  high  data-rates  to  be  obtained  without  burdening  the 
host  CPU  as  a  co-processor  and  hardware  accelerator. 

The  RCP  could  also  be  a  major  complement  to  the  current  capabilities  of  typical  Visual 
Telemetry  Systems,  such  as  the  NeTstar  system  from  L3  Communications,  shown  in 
Figure  27.  The  Loral  550  system  could  also  benefit  from  the  RCP  architecture  provided 
that  the  RCP  is  mapped  to  a  VME  6U  card.  The  RCP 
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Figure  27.  NeTstar  block  diagram. 

will  provide  the  processing  solutions  needed  for  advanced  telemetry  as  a  hardware 
accelerator.  The  only  functions  that  it  is  not  capable  of  replacing  are  the  RF  sections.  The 
RCP  can  provide  a  unique  answer  to  telemetry  needs.  As  a  configware  solution,  the  RCP 
will  provide  significant  cost  savings  for  the  level  of  performance  that  it  offers.  Most 
telemetry  solutions  are  software  solutions.  The  RCP  is  a  hardware  solution.  Most 
hardware  solutions  are  fixed  to  a  certain  handshaking  protocol.  The  RCP  can  be 
reconfigured  to  meet  the  distinct  protocol  requirements  that  exist  between  different 
satellite  systems.  The  programmable  nature  of  the  RCP  gives  it  a  natural  ability  to  input 
serial  and  parallel  telemetry  streams  and  route  them  in  various  ways,  making  it  an  ideal 
solution  for  front-end  telemetry  and  command  systems.  Also,  the  core  hardware  on  the 
RCP  board  is  not  dependent  on  any  specific  system  requirements.  For  example,  the  RCP 
could  alter  itself  to  handle  both  Ka-Band  and  L-Band  video  data  processing  requirements. 
So  separate  specialized  processors  are  no  longer  necessary  since  the  RCP  can  mimic  their 
functions.  Consequently,  the  RCP  is  flexible  enough  to  minimize  the  need  for  expensive 
hardware  redesign. 


3.3.5  Dynamic  ‘‘On-the-Fly”  Reconfiguration 

The  RCP  can  be  used  as  a  critical  component  in  systems  that  require  or  desire  real-time 
processing  through  rapid  reconfiguration.  Real-time  processing  relies  upon  computational 
algorithms  whose  correctness  depends  not  only  on  the  logical  results  of  computation,  but 
also  on  the  time  at  which  the  results  are  produced.  For  example,  an  image  processing 
application  for  object  thinning  may  demand  separate  pre-flltering  and  thresholding  steps 
before  [19]  or  even  while  running  the  thinning  operation,  thus  requiring  rapid 
reconfiguration.  Implementations  of  such  algorithms  can  be  enhanced  by  using  the  RCP 
architecture.  The  RCP  will  enable  the  user  to  achieve  the  utmost  reliability  due  to  its 
flexible  architecture  and  performance  through  its  rapid  reconfiguration  features.  The  RCP 
will  utilize  state  of  the  art  enabling  technology  and  design  methodology  [20]  to  achieve 
these  results.  Configware  in  general  has  been  shown  to  be  very  suitable  for  applications 
that  demand  fault  tolerance.  These  applications  include  advanced  algorithms  that  are  used 
in  networking,  communications  and  control  systems,  where  rapid  reconfiguration 
provides  the  best  means  of  adapting  when  faced  with  “exogenous  shocks” — a,  term  for 
unforeseen  events  like  the  loss  of  a  network  router  or  a  malfunctioning  machine  in  an 
industrial  assembly  line — which  can  significantly  degrade  system  reliability  and 
performance.  The  RCP  design  has  reconfiguration  data-flow  dependencies  in  mind  and 
incorporates  FIFO  structures  for  tight  latency  requirements,  localized  dual  port  memory 
for  fast  computational  addressing  and  deep  RAM  for  quick  data  storage  and  retrieval  of 
partial  products  used  in  high  bandwidth  computations.  An  advanced  architectural  scheme 
was  created  to  handle  the  interconnect  between  different  elements  within  the  RCP 
architecture.  The  memory  available  on  the  RCP  can  also  be  utilized  to  store  different 
configurations  for  the  FPGA  PE  PLDs  and  ICube  IQX160s  that  populate  the  RCP  core 
and  thereby  allows  the  core  to  not  completely  be  dependent  on  the  PCI  bus.  These 
features,  and  a  PCI  interface,  will  allow  the  RCP  to  outperform  other  methods  and  other 
reconfigurable  processors  [21]  to  do  real-time  signal  processing. 

Evolvable  Hardware 

Engineering  can  be  described  as  finding  and  providing  solutions  to  optimization  problems 
within  a  set  of  design  constraints.  Evolvable  algorithms  are  tomorrow’s  most  robust 
solutions  to  optimization  problems  and  represent  a  paradigm  shift  fi*om  today’s  adaptive 
algorithms.  The  RCP  architecture  with  its  partial  reconfiguration  capability  is  a 
appropriate  platform  for  exploring  evolvable  hardware  implementations.  Evolvable 
algorithms  differ  from  adaptive  algorithms  in  that  they  reorganize  their  operation  without 
a  priori  knowledge  of  system  constraints.  The  constraints  are  nominally  derived  based 
upon  the  usage  of  heuristics  or  rule-bases  like  adaptive  algorithms.  However,  they  do  not 
remain  fixed.  The  constraints  can  change  according  to  the  tenets  of  evolution  and  chaos 
theory.  They  often  are  the  most  critical  element  within  the  architectural  implementation 
of  the  algorithm,  irrespective  of  whether  the  algorithm  is  adaptive  or  evolutionary. 
Adaptive  architectures  have  a  fixed  silicon  structure  and  normally  have  variable 
coefficient  parameters  that  are  stored  either  in  a  lookup  table  or  in  memory.  Their 
constraints  are  often  predetermined  error  terms.  An  example  of  this  is  the  weights  in  the 
least  mean  squares  (LMS)  algorithm.  Evolvable  architectures  on  the  other  hand  alter  both 
their  basic  silicon  structure  and  cell  connectivity  to  operate  within  a  given  set  of 
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constraints.  These  evolvable  constraints  may  vary  in  complexity,  implying  that  they 
cannot  be  represented  for  all  known  a  priori  conditions.  These  conditions  make  up  the 
environment  in  which  the  circuit  operates.  As  the  conditions  change  and  evolve,  the 
architecture  changes  to  minimize  error  terms  to  optimize  operations  to  constraints 
inferred  within  the  context  of  the  new  environment.  Evolvable  architectures  are 
intelligent  in  that  they  encompass  a  process  of  natural  selection  that  makes  them  able  to 
find  an  optimal  solution  to  any  given  situation.  With  situational  awareness,  they  can 
mutate  to  retain  an  optimal  solution  in  real-time  systems. 

Evolvable  hardware  has  numerous  aspects  that  make  it  advantageous  over  other  possible 
methodologies.  Architectural  changes  can  be  made  to  optimize  tradeoffs  between  desired 
performance,  area  and  other  constraints  based  upon  actual  conditions.  For  instance,  a 
typical  adaptive  low  pass  FIR  filter  may  begin  by  modifying  coefficients  to  achieve  a 
certain  degree  of  optimization  in  terms  of  how  sharp  its  transition  bandwidth  is.  But  this 
may  not  be  sharp  enough,  so  the  filter  can  then  begin  to  evolve  until  it  has  twice  as  many 
taps.  This  might  give  a  certain  transition  band  within  a  relatively  short  period  of  time.  But 
suppose  the  system  desires  an  even  sharper  cutoff  to  attenuate  a  noisy  channel.  The 
system  could  then  evolve  into  a  HR  filter  to  gain  an  even  tighter  transition  bandwidth  at 
the  expense  of  a  slight  slowdown  in  speed. 

Evolvable  hardware  is  not  too  different  from  today’s  efficient  VLSI  design  practices. 
Parallel  architecture  features  allow  for  quick  operation  and  cellular  architecture  features 
that  are  representative  of  the  divide-and-conquer  approach  enable  incremental  structural 
changes  that  make  partial  context  switching  possible.  A  key  question  is  what  applications 
require  operation  in  unknown  a  priori  constraints?  There  are  numerous  applications, 
some  of  which  we  will  list  here.  A  primary  set  of  applications  include  those  that  utilize 
genetic  algorithms.  One  possible  application  is  in  environments  where  there  are  fi'equent 
temperature  extremes,  such  as  plasma  antenna  array  satellites  operating  in  space.  Genetic 
algorithms  implemented  in  evolvable  hardware  would  enable  a  expensive  satellite  to 
extend  their  lifespan  as  well  as  their  effectiveness  perhaps  by  altering  itself  in  real-time 
based  upon  solar  wind  patterns  and  system  usage  patterns.  Another  application  is  in 
systems  that  involve  complex  network  configurations  such  as  those  found  in  the 
telephone  and  computer  industries.  Here  a  genetic  algorithm  implementation  of  simulated 
annealing  might  be  utilized  to  provide  an  optimal  solution  to  the  traveling  salesman 
problem  for  finding  the  minimal  routing  delay  between  two  or  more  points.  And 
predictive  analysis  for  data  mining  within  the  stock  market  could  be  a  rather  lucrative 
application.  Rapid  changes  in  stock  market  indicators  used  in  derivatives  can  be  analyzed 
ahead  of  the  curve  to  maximize  net  profit  gains  with  evolvable  hardware.  Since  advanced 
signal  conditioning  based  upon  vector  fields  could  be  accelerated  to  deal  with  matrices 
without  full  rank  or  matrices  that  have  full  rank  but  have  eigenvalues  that  are  ill 
conditioned.  Other  areas  of  interest  include  unmanned  vehicles  and  diagnostic 
applications  in  medicine.  In  short,  with  evolvable  hardware,  real-time  asset 
reconfiguration  enables  next  generation  solutions  by  enabling  smart  machine  designs. 
Perhaps  someday  a  doctor  could  apply  a  skin  patch  on  a  sick  patient  to  monitor  the 
insulin  and  oxygen  levels  found  in  their  blood  and  have  it  later  evolve  to  monitor 
cholesterol  as  the  patient  begins  to  recover  using  the  same  sensor. 
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A  Future  FPGA  Technology  Niche 

FPGA  technology  brings  some  unique  advantages  that  are  absent  from  traditional  VLSI 
architectures.  The  key  advantage  offered  is  the  ability  to  physically  change  through 
reconfiguration.  This  allows  for  generic  processing  that  used  to  only  be  possible  in 
software  to  be  sped  up  by  mapping  it  directly  into  hardware.  This  hybrid  technology 
combines  the  versatility  of  a  standard  CPU  with  the  performance  of  a  standard  cell 
custom  ASIC.  The  other  key  advantage  is  the  shrinking  form  factor  size  requirements  that 
reconfiguration  enables.  The  same  piece  of  hardware  could  be  re-utilized  to  perform  a 
completely  different  function.  For  example,  a  digital  handheld  real-time  video 
telecommunications  application  is  a  prime  target  as  a  future  application  since  multiple 
functions  can  be  realized  using  the  same  hardware.  Such  an  application  might  rely  on  a 
few  evolvable  system  configurations  to  adjust  to  bandwidth  and  power  tradeoffs.  Of 
course  requiring  the  end-user  to  have  the  technical  background  needed  to  create  their  own 
configuration  files  using  a  hardware  description  language  would  unduly  restrict  market 
opportunities.  So  it  is  envisioned  that  users  could  download  new  configurations  for  a 
business  communications  supplier  through  a  modem  configuration,  allowing  the  device  to 
be  truly  multi-functional  and  enabling  a  large  potential  consumer  base  to  exist.  Future 
internal  device  characteristics  of  FPGAs  are  expected  to  radically  change  as  well.  So  the 
current  existence  of  SRAM  based  FPGA  technology  should  not  be  thought  of  as  a 
prelude  to  the  demise  of  integrated  circuit  digital  design.  Ultra  high  speed  interconnects 
and  resulting  wire  crosstalk  still  remain  a  major  research  challenge  that  needs  to  be 
explored  and  conquered  on  the  device  and  systems  level.  Novel  architectures  are 
constantly  being  introduced  and  many  discrete  components  have  not  yet  been  completely 
integrated.  For  example,  new  Flash  technology  FPGAs  that  have  half  the  relative 
capacitance  and  seven  times  the  density  of  their  SRAM  counterparts  with  fast  and 
frequent  access  time  requirements  are  currently  under  development.  Other  future 
developments  will  probably  include,  but  not  be  limited  to,  a  merging  of  sensor  arrays 
with  configware  for  situation  awareness.  FPGA  technology  is  the  only  available 
technology  that  has  the  flexibility  and  performance  to  make  evolvable  hardware  feasible. 
As  its  name  implies,  evolvable  hardware  may  someday  exist  with  the  capability  of 
recreating  the  complex  genetic  DNA  interactions  found  only  in  biologically  living 
creatures. 

Today,  reconfigurable  hardware  solutions  are  expanding  to  wide  variety  of  engineering 
applications  by  leveraging  off  the  increasing  density  available  in  FPGA  technology  and 
the  flexibility  of  ASIC  Field  Programmable  Interconnects  (FPIC).  Having  hardware 
commonality  between  subsystems  reduces  costly  development  efforts  by  re-using  the 
same  hardware  for  different  applications.  Reconfigurability  further  accelerates 
productivity,  providing  rapid  conversion  from  concept  to  reality  by  allowing  full  system 
implementations  without  additional  hardware  development  costs.  The  motivation  for  the 
RCP  research  was  to  investigate  architectures  for  the  next  generation  of  cost  effective, 
high  performance  BMDO  applications.  Given  the  tremendous  diversified  growth  in 
telemetry,  remote  sensing,  telephony,  and  telecomputing,  robust  processing  of  multi¬ 
source  acquired  data  signals  for  digital  communication  through  reconfiguration  becomes 
a  paramount  system  concern.  Further  work  in  reconfigurable  parallel  and  distributed 
signal  processing  architectures  will  be  investigated  in  Phase  II  to  enable  new  and 
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improved  algorithms  to  be  mapped  into  the  ReConfigurable  Processor  (RCP).  The  RCP 
will  be  designed,  developed,  and  fabricated  during  the  2nd  phase  of  this  ReConfigurable 
Processing  research  project.  Such  configware  solutions  are  needed  to  handle  the  diverse 
bandwidths,  high  data  rates,  varying  signal  qualities,  and  variable  signal  processing  needs 
in  current  and  forthcoming  BMDO  systems 

3.4  Task  4:  Correlation  Matrix  Estimate  Comparison  in  DOA  Finding 
3.4.1  Introduction 

In  order  to  study  the  effect  of  error  in  the  correlation  matrix  estimate  in  a  specific 
example,  we  consider  direction  of  arrival  (DOA)  estimation  with  the  correlation 
estimators  described  above.  We  compared  the  correlation  estimators  in  estimating  the 
DOA  of  a  sinusoidal  signal  on  an  antenna  array. 

In  the  conventional,  complex  hybrid  sign  and  the  quadruplex  correlation  estimates,  the 
variances  have  been  determined  theoretically.  The  conventional  variance  is; 

vai{Ry)  =  cr*  I N 

The  variance  for  the  quadruplex  estimate  is  given  by; 

VZT(Ry)  =  (^-2)^ 

It  has  been  shown  in  [22]  that  both  the  complex  hybrid  sign  and  the  quadruplex 
approximation  exibit  higher  variances  than  the  conventional  method.  Numerically,  these 
varainces  are  between  50  —  100%  and  14%  greater  then  that  of  the  conventional  estimator 
for  hybrid  sign  and  quadruplex  respectively. 

We  applied  the  correlation  estimators  to  the  problem  of  DOA  estimation  using 
Beamforming  [23].  Beamforming  takes  the  inner  product  of  the  output  of  an  antenna 
array  with  a  weight  vector  to  produce  an  output  signal  (on  which  further  temporal  or 
spatial  processing  may  be  done).  This  inner  product  is  typically  used  to  isolate  the  signal 
coming  from  a  particular  direction.  Assuming  there  is  a  signal,  which  is  significantly 
larger  than  the  background  noise  impinging  on  an  array; 

x  =  As +n 

where  x  is  the  received  signal,  A  is  the  antenna  steering  matrix,  s  is  the  transmitted  signal, 
and  n  is  additive  white  Gaussian  noise.  We  can  train  the  weight  vector  to  only  allow  that 
signal  to  pass  through  the  inner  product,  essentially  carrying  out  a  spatial  filter  of  the 
incoming  signal.  This  is  done  by  using  gradient  ascent  to  maximize  the  power  of  the 
inner  product  output; 

maxfPfw))  =  m2ix(E(w'^xx^w))  =  raaxiw"  R^w) 


Furthermore,  by  examining  the  weight  vector  (taking  its  inner  product  with  signals 
impinging  from  all  possible  directions),  we  can  estimate  the  Direction  of  Arrival  (DOA) 
of  the  signal  of  interest.  This  DOA  estimate  depends  critically  on  the  estimate  of  the 
correlation  matrix  R^,^,  so  we  will  study  the  effect  of  the  correlation  estimate  on  the  DOA 
estimate. 
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3.4.2  Computational  Complexity 


The  next  issue  is  computational  complexity  of  the  three  different  correlation  estimators. 
The  complexity  will  be  based  on  multiplications/divisions,  additions/subtractions, 
comparisons,  and  shift  for  negative  signs.  Given  an  L  sensor  element  array  and  a 
sequence  of  N  time  samples,  the  computational  burden  for  (R)  y  the  i*  and  element  in 
the  signal  correlation  matrix  are  in  the  Tables  1 1  &  12  below. 

The  comparison  of  the  real  conventional,  hybrid  sign  and  quadruplex  : 


Table  1 1 :  The  complexity  for  a  single  element  in  the  correlation  matrix 


Multiplies/ 

Total 

Additions 

/Total 

Compares/ 

Totals 

Shift  (negative 
sign)Totals 
at  most 

Real  Data 

conventional 

N 

(N-1) 

0 

0 

hybrid  sign 

1 

2(N-1) 

1 

2 

quadruplex 

1 

2(N-1) 

8 

2 

Complex  Data 

conventional 

4N 

2(N-1) 

0 

0 

Hybrid  sign 

4 

4(N-1) 

2/ 

2 

quadruplex 

4 

4(N-1) 

8/ 

2 

Table  12:  The  complexity  for  calculating  the  entire  correlation  matrix 


Multiplies/ 

Total 

Additions 

/Total 

Compares/ 

Totals 

Shift  (negative 
sign)Totals 
at  most 

Real  Data 

conventional 

NL^ 

(N-1)  V 

0 

0 

hybrid  sign 

L^+2 

2(N-1)  V 

V 

V 

quadruplex 

L^+2 

2(N-l)V 

V 

V 

Complex  Data 

conventional 

4NL" 

2(N-1)  V 

0 

0 

Hybrid  sign 

4L^+2 

4(N-1)  V 

V 

V 

quadruplex 

4L"+2 

4(N-1)  V 

V 

V 

We  also  compare  the  computational  complexity  of  correlation  estimators  for  a  specific 
example,  to  illustrate,  in  real  numbers,  the  potential  computational  efficiency  of  using  the 
alternative  correlation  estimator,  and  especially  the  quadruplex  estimator.  The  results  are 
tabulated  below  in  Table  13. 
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Table  13:  Calculating  the  correlation  estimates  of  the  different  methods 


Total  bit  level 
operations 

Bit  level 
Multiplies 

Bit  level  Additions 

Complex  Data 

conventional 

989,900 

960,000 

29,900 

N  =  300 

hybrid  sign 

122,816 

3,216 

119,600 

quadruplex 

371,000 

11,200 

359,800 

Complex  Data 

conventional 

10,199,800 

9,600,000 

599,800 

N  =  3000 

Hybrid  sign 

1,201916 

3,216 

r  1,199,600 

quadruplex 

3,611,000 

11,200 

3,599,800 

3.4.3  Simulation  Results 

We  simulated  the  beamforming  algorithm  and  the  three  correlation  estimates  in  Matlab. 
A  sinusoidal  signal  was  assumed  to  be  transmitted  from  7t/8  degrees  relative  to  the 
normal  to  the  array.  We  used  a  10  element  linear  antenna  array  uniformly  spaced  at  d=X./2 
where  X,  is  the  wavelength  of  the  incoming  signal.  In  our  case,  we  used  a  2MHz 
bandwidth  signal  sampled  at  2MHz  to  avoid  aliasing.  The  incoming  signal  was  complex 
as  was  the  antenna  array  steering  matrix  and  the  white  Gaussian  noise.  We  assumed  no 
fading  and  additive  white  Gaussian  noise  of  varying  power.  The  weight  training 
procedure  consisted  of  updating  the  weight  vector  50  times  for  each  set  of  time  samples 
(i.e.  for  each  correlation  matrix)  according  to  the  gradient  ascent  equations  mentioned 
above.  This  number  of  updates  was  chosen  large  enough  to  ensure  convergence.  We  did 
not  address  questions  of  convergence  speed  in  this  report,  only  approximation  accuracy. 
Each  set  of  parameters  was  run  for  10  trials  to  determine  mean  and  variance  of  the  DOA 
estimate. 

We  were  interested  in  the  correlation  approximations  as  a  function  of  several  variables. 
We  varied  the  SNR,  number  of  time  samples  per  correlation  matrix  formulation,  and  of 
course,  the  approximation  method  (conventional,  hybrid,  or  quadruplex).  The  variances 
for  the  different  approximation  methods  given  by  Sullivan  [4]  were  predicated  on  the 
input  signal  being  independent  and  identically  distributed  Gaussian  random  variables.  We 
therefore  used  very  low  SNR  to  see  if  our  quadruplex  and  hybrid  variances  matched  up 
with  theory.  However,  in  most  practical  situations,  the  signal  is  larger  than  the  noise.  To 
model  realistic  situations,  we  allowed  much  higher  SNR.  We  were  also  interested  in  the 
tracking  ability  of  these  algorithms,  i.e.  how  well  would  they  perform  if  the  signal  of 
interest  were  moving  (as  happens  in  a  cellular  telephone  environment,  for  instance).  This 
implies  that  samples  must  be  taken  over  short  time  windows,  such  that  the  signal  has 
roughly  stationary  parameters  over  the  set  of  samples.  We  therefore  varied  the  number  of 
samples  per  correlation  matrix  to  see  how  the  correlation  approximation  biases  and 
variances  were  affected  by  number  of  samples. 

The  simulation  results  show  good  correlation  with  theory  at  low  SNR  and  show  very 
interesting  behavior  at  high  SNR.  The  variances  for  the  quadruplex  and  hybrid 
approximations,  derived  assuming  Gaussian  distributed  random  variables,  showed  a  14% 
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increase  from  conventional  to  quadruplex  estimates.  The  biases  of  our  estimates  are 
shown  in  the  plots  in  figure  17.  For  low  SNR  (0,  6,  10),  the  quadruplex  variances  are 
approximately  16%  larger  than  the  conventional  method  variances.  This  agrees  well  with 
the  theoretical  value  of  14%.  This  indicates  that  our  quadruplex  and  conventional 
correlation  calculations  are  working  properly. 


Figure  17:  Bias  in  radians  vs.  SNR  for  hybrid  sign  (dot-dashed  line),  quadruplex 
(dashed  line),  and  conventional  (dotted  line)  correlation  approximations.  N=300 
samples  on  the  right  and  N=3000  samples  on  the  left. 

However,  a  very  interesting  phenomenon  occurs  at  high  SNR  (>10dB).  The  bias  and  the 
variance  of  the  quadruplex  estimator  is  actually  lower  than  that  of  the  conventional 
estimator. 

3.4.4  Discussion 

Given  the  above  performance  results,  we  can  classify  the  suitability  of  the  various 
correlation  approximation  methods  under  different  conditions.  The  hybrid  approximation 
clearly  performs  poorly  across  all  SNR  levels  having  a  bias  of  from  25%  to  400%  larger 
than  the  conventional  approximation.  The  quadruplex  estimator,  on  the  other  hand, 
performs  very  well  compared  to  the  conventional  approximation  across  all  SNR  levels. 
The  most  striking  feature  is  that  the  quadruplex  approximation  has  lower  bias  and 
variance  than  the  conventional  approximation  for  high  SNR.  This  is  unexpected  since  the 
conventional  approximation  produces  a  theoretically  optimal  estimate  of  the  correlation 
matrix.  This  improved  bias  and  variance  warrant  further  investigation. 

These  results  also  impact  calculation  of  the  inner  product  in  neural  networks.  The 
quadruplex  and  hybrid  sign  are  simply  approximations  of  irmer  products  [3].  They  are 
therefore  applicable  to  neural  networks  where  the  irmer  product  is  calculated  between  the 
input  vector  or  output  vector  from  the  previous  layer  and  the  weight  vector  corresponding 
to  each  processing  element.  Our  results  suggest  that  the  quadmplex  approximation  would 
be  most  successful  in  neural  network  application. 
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4  Phase  I  Summary  and  Recommendations  for  Phase  II 

4. 1  Summary  of  Phase  I  Effort 

The  Phase  I  accomplishments  can  be  listed  as  follows: 

•  Correlation  matrix  estimators  implemented  &  verified  on  an  FPGA  processor. 

•  Nonlinear  correlation  estimators  provide  up  to  twice  the  throughput  rate  in  bit  serial 
and  3-5  times  the  throughput  rate  in  bit  parallel  FPGA  implementations  compared  to  a 
conventional  estimator 

•  Nonlinear  Correlation  estimators  require  1/3  to  I/IO"’  the  number  of  bit  level 
operations  that  the  conventional  estimator  requires. 

•  Area  and  time  can  be  traded  off  against  one  another  to  achieve  a  desired  throughput 
rate  given  a  certain  number  of  FPGA  processors. 

•  The  quadruplex  estimator  provides  the  least  error  in  estimate  while  the  hybrid  sign 
and  polarity  coincidence  provide  the  greatest  speed  increase. 

•  A  Multiply  Interconnected  FPGA  ReConfigurable  Processor  Core  Architecture  was 
specified  with  applicability  to  a  variety  of  signal  processing  functions. 


4.2  Phase  II  Recommendations  for  FPGA  RCP  Development  Targeted  to 
BMDO  Programs 

Given  the  different  domains  of  applicability  of  FPGA  processors  and  DSP’s,  we  will 
propose  (in  phase  II)  development  of  a  complete  hybrid  reconfigurable  system  combining 
the  best  of  both  FPGAs  (consisting  of  the  Multiple  Intercoimected  FPGA  based  RCP  core 
proposed  as  a  result  of  the  Phase  I  study)  and  DSP  processors.  We  will  then,  under  close 
consultation  with  BMDO,  implement  an  application  of  interest  to  BMDO  on  the  hybrid 
architecture. 

The  Phase  III  task  will  be  the  field  testing  of  the  hybrid  architecture  on  a  specific  program 
relevant  to  BMDO’s  mission,  along  with  the  commercialization  of  the  proposed  RCP  for 
other  applications. 


4.3  Some  Potential  CME  Applications  of  Particular  Interest  to  BMDO 

We  complete  this  report  by  listing  two  very  important  applications  of  the  proposed  FPGA 

technology: 

•  Adaptive  (Kalman)  filter  tracking  of  Missiles  subject  to  false  target,  decoy,  and  chaff 
countermeasures. 

•  Automatic  Target  Recognition  (ATR)  of  incoming  missile  bogeys  (air-to-air 
interceptor’s  and  ground-to-air  seeker’s  ATR  processing),  especially  MIRVed 
ballistic  missile  warheads. 
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5  Appendices 


5. 1  VHDL  code  for  correlation  estimators 

This  section  lists  the  VHDL  code  written  to  fulfill  the  goals  of  Task  1  of  the  contract. 
This  listing  does  not  include  all  the  code  written  for  the  task,  as  including  all  the  code 
written  would  have  been  prohibitive. 

5.1.1  Polarity  Coincidence  Correlation  Estimator: 

This  is  the  estimator  requiring  the  least  computation  but  offering  poor  accuracy.  The  main 
listing  is  on  the  next  few  pages.  Subroutines  are  given  at  the  end  of  this  section. 
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--  Implement  the  following  function  for  256  samples. 

Rij=(pi/(2*n)sum(sign(X)*sign(Y))  l<=n<=256 
Input  is  byte  serial 


LIBRARY  ieee; 

USE  ieee.stdJogic_l  164.ALL; 

LIBRARY  arithmetic: 

USE  arithmetic. std_logic_arith. ALL; 

Graphics  Design  Library 

ENTITY  poIcoin_s  IS 

PORT  (SIGNAL  xysig  :IN  SIGNED(7  DOWNTO  0); 
read  on  the  rising  edge  of  elk. 

SIGNAL  elk  ;IN  STD.LOGIC; 

SIGNAL  ee  :IN  STD_LOGIC; 

SIGNAL  ares  :IN  STD.LOGIC; 
resets  to  zero. 

SIGNAL  rsignal  :OUT  SIGNED(1 1  DOWNTO  0)); 
comp  output  with  4  binary  places,  ie.  pos  values  look  like; 

with  4  binary  places,  ie.  pos  values  look  like: 

2^-1,  2''-2,  2M,  2M,  2''-5,  2^-6,  2^-1, 2''-8]. 

ARCHITECTURE  behav  OF  polcoin_s  IS 


—Mentor 

—2s  comp  input, 

-clock . 

-clock  enable, 
-asynchronous  reset, 

-2s 

-2s  comp  output 
-  [0,  2'^2,  2,  1  . 


SIGNAL  xinput  :SIGNED(7  DOWNTO  0); 

-  Intermediate  signals  between  the  input  buffer  and  the  quadrx_p 
SIGNAL  yinput  :SIGNED(7  DOWNTO  0); 

"  component. 

SIGNAL  clk2  :STD_LOGIC; 

Divide  by  2  clock,  transitions  occur  on  the  rising  edge  of  elk. 

SIGNAL  clk2inv  :STD_LOGIC; 

Inverse  of  clk2. 

SIGNAL  enclk2  :STD_LOGIC; 

enable  of  the  input  buffer  register  that  holds  the  y  value. 

SIGNAL  enclk2inv  :STD_LOGIC; 

enable  of  the  input  buffer  register  that  holds  the  x  value 
BEGIN 

clk2inv  <=  NOT  clk2; 

enclk2  <=  ce  AND  clk2; 

enclk2inv  <=  ce  AND  clk2inv; 

clockhalf:  ENTITY  work.dff_pea(behav) 

PORT  MAP  (q=>clk2,  d=>clk2inv,  clk=>clk,  ares=>ares  ce=>ce)- 
output  is  clk2  and  clk2inv. 


--  Clock 
-  Clock 


"  Clock  divider. 


regx:  ENTITY  work.rcgs_pea(behav) 
register  that  latches  current  value  of  x. 

GENERIC  MAP  (width  =>  8) 
enter  xysig,  alternating  in  a  byte  serial 

PORT  MAP  (q  =>  xinput,  d  =>  xysig,  elk  =>  elk,  ee  =>  enelk2inv,- 
the  X  value  on  the  rising  edge  of 

ares  =>  ares); 

and  clk2inv  =  1  (clk2  =0). 

regy:  ENTITY  work.regs_pea(behav) 
register  that  latches  current  value  of  y. 

GENERIC  MAP  (width  =>  8) 
enter  xysig,  alternating  in  a  byte  serial 

PORT  MAP  (q  =>  yinput,  d  =>  xysig,  elk  =>  elk,  ce  =>  enclk2,  - 
the  y  value  on  the  rising  edge  o 

ares  =>  ares); 

and  clk2  =  1 . 

qrx:  ENTITY  work.polcoin„p(behav) 
estimator,  clocked  on  the  rising  edge  of  clk2. 

PORT  MAP  (xsignal  =>  xinput,  ysignal  =>  yinput,  elk  =>  clk2,  - 
and  regy  components. 

ce  =>  ce,  ares  =>  ares,  rsignal=>  rsignal); 

END  ARCHITECTURE  behav; 


—  Input  buffer 
—  X  and  y  values 
-  fashion.  Regx  latches 

-  elk  when  ce  =1 

-  Input  buffer 

-  X  and  y  values 
■  fashion.  Regy  latches 

-  elk  when  ce  =1 

-  Quadruplex 
Inputs  are  from  the  regx 


--  Implement  the  following  function  for  256  samples. 

Rij=(pi/(2*n)sum(sign(X)*sign(Y))  l<=n<=256 
Input  is  byte  parallel 


LIBRARY  ieee; 

USE  iece.stdJogic_l  164. ALL; 

LIBRARY  arithmetic; 

USEarithmetic.std_logic_arith.ALL; 

Graphics  Design  Library 

ENTITY  poIcoin_p  IS 

PORT  (SIGNAL  xsignal  ;IN  SIGNED(7  DOWNTO  0); 
read  on  the  rising  edge  of  elk. 

SIGNAL  ysignal  ;IN  SIGNED(7  DOWNTO  0); 
read  on  the  rising  edge  of  elk. 

SIGNAL  elk  :IN  STD_LOGIC; 

SIGNAL  ee  :IN  STD_LOGIC; 

SIGNAL  ares  :IN  STD_LOGIC; 
resets  to  zero, 

SIGNAL  rsignal  :OUT  SIGNED(1 1  DOWNTO  0)); 
eomp  output  with  4  binary  plaees,  ie.  pos  values  look  like; 

2^-1,  2^-2,  2^-3, 2''-4,  2''-5,  2''-6,  2^-1,  2^-8] 

END  ENTITY  poleoin_p; 

ARCHITECTURE  behav  OF  poleoin_p  IS 

SIGNAL  sgnx  :SIGNED(I  DOWNTO  0); 

—sign  of  xsignal. 

SIGNAL  sgny  :SIGNED(1  DOWNTO  0); 

-sign  of  ysignal. 

SIGNAL  sproduct  :SIGNED(3  DOWNTO  0); 

product  of  signs. 

SIGNAL  sextend  :SIGNED(9  DOWNTO  0); 

--sproduet  extended  to  a  10  bit  2s  eomp  value. 

SIGNAL  stotal  :SIGNED(9  DOWNTO  0); 

-aeeumulated  product  of  signs. 

SIGNAL  piproduct  :SIGNED(18  DOWNTO  0); 

-stotal  multipled  by  pi. 

SIGNAL  longrsig  :SIGNED(18  DOWNTO  0); 

piproduct  shifted  to  remove  non  significant  bits. 

BEGIN 

sgny(l)  ysignal(7); 

is  identical  to  the  MSB  of  the  signal. 

sgnx(l)  <=  xsignal(7); 

sgny(O)  <= 'O' WHEN  ysignal  =  "OOOOOOOO”  ELSE  T‘* 

is  zero  when  the  signal  is  zero. 

sgnx(O)  <zr  ‘O'  WHEN  xsignal  =  "00000000"  ELSE  T; 


—Mentor 


-2s  comp  input, 

—2s  comp  input, 

-clock . 

-clock  enable, 
-asynchronous  reset, 

-2s 


-  [0,  2^2,  2,  1  . 


—the  MSB  of  sign 


-the  LSB  of  sign 


finput  <=ysignal  WHEN  crossxy='  I '  ELSE  xsignal; 

ginput  <=xsignal  WHEN  crossxy='  1 '  ELSE  ysignal; 

fextend:  ENTITT  work.twoscomp(behav) 

2s  comp  finpul  to  the  16bit  2s  comp  fsigibbit 

GENERIC  MAP  (iwidth  =>  8,  owidth  =>  16) 
according  to  invertxy. 

PORT  MAP  (input  =>  finput,  zero  =>  'O',  ncg  =>  invertxy, 
output  =>  fsigl6bit); 

gextend:  ENTITY  work.twoscomp(behav) 

2s  comp  ginput  to  the  16bit  2s  comp  gsigl6bit 

GENERIC  MAP  (iwidth  =>  8,  owidth  =>  16) 
according  to  invertxy. 

PORT  MAP  (input  =>  ginput,  zero  =>  'O',  neg  =>  invertxy, 
output  =>gsigl6bit): 

faccum:  ENTITY  work.accums(behav) 
fsigl6bit  and  output  the  16bit  2scomp  ftotal. 

GENERIC  MAP  (width  =>  16) 

PORT  MAP  (input  =>  fsigl6bit,  output  =>  ftotal,  elk  =>  elk, 
ce  =>ce,  ares  =>  ares); 


-Extend  the  8bit 
-and  invert 


-Extend  the  8bit 
—and  invert 


-Accumulate  the 


gaccum:  ENTITY  work.accums(behav) 
gsigI6bit  and  output  the  16bit  2scomp  gtotal. 

GENERIC  MAP  (width  =>16) 

PORT  MAP  (input  =>  gsig  1 6bit,  output  =>  gtotal,  elk  =>  elk, 
ce  =>ce,  ares  =>  ares); 


“Accumulate  the 


fgproduct  <=  ftotal  *  gtotal; 

products. 

piproduct  <=  fgproduct  *  "01 1001001 "; 

to  6  places;  pi  is  estimated  by  1 1.001001  base2. 

longrsig  <=  piproduct  /  "01 0000000000000000000"  ■ 
than[((N*N*2)*(2'^6)],  this  will  leave  4 

rsignal  <=  longrsig(19  DOWNTO  0); 

(2''0)  place.  Result  is  "[19:4].[3:0]". 


-Perform  the  final 

—pi  *2^6  rounded 
-Divide  by  16  less 

—places  after  the 


END  ARCHITECTURE  behav; 


LIBRARY  iece; 

USE  ieee.std_logic_1 1 64. ALL; 

LIBRARY  arithmetic; 

USE  arithmetic.,std_logic_arith.ALL; 

ENTITY  accums  IS 
GENERIC 

(width:  POSITIVE  := 1 6); 

PORT 

(SIGNAL  input  :IN  SIGNED(width-I  DOWNTO  0); 
SIGNAL  output  rINOUT  SIGNED(width- 1  DOWNTO  0)  • 
SIGNAL  elk  :IN  STD_LOGIC; 

SIGNAL  ce  :IN  STD_LOGIC; 

SIGNAL  ares  ;IN  STD_LOGIC); 

END  ENTITY  accums; 

ARCHITECTURE  behav  OF  accums  IS 

SIGNAL  summation  :SIGNED(width- 1  DOWNTO  0); 

BEGIN 

summation  <=  output  +  input; 

regis:  ENTITY  work.regs_pea(behav) 

GENERIC  MAP  (width  =>  width) 

PORT  MAP  (q  =>  output,  d  =>  summation,  elk  =>  elk, 
ce  =>ce,  ares  =>  ares); 


END  ARCHITECTURE  behav; 


LIBRARY  ieee; 

USE  ieee.stdJogic_l  164.ALL; 


LIBRARY  arithmelic; 

USE  arilhmelic.sid_logic_arith.ALL; 


ENTITY  dff_pea  IS 


PORT 

(SIGNAL  q 
SIGNAL  d 
SIGNAL  elk 
SIGNAL  ce 
SIGNAL  ares 
END  ENTITY  dff_pea; 


:INOUTSTD_LOGIC; 
:IN  STD_LOGIC; 
;IN  STD_LOGIC; 
:IN  STD_LOGIC; 
:IN  STD_LOGIC); 


ARCHITECTURE  behav  OF  dff_pea  IS 
BEGIN 


reg:  PROCESS  (ares,  elk)  IS 

BEGIN 

IF  ares  =  T'  THEN 
q  <=  'O’; 

ELSIF  ares  ='0’  and  (clk'event  and  elk  ='!')  THEN 
IFce=T’THENq<=d; 

ELSE  q<=  q; 

END  IF; 

END  IF; 

END  PROCESS; 


END  ARCHITECTURE  behav; 


LIBRARY  ieee; 

USE  ieee.std_logic_l  I64.ALL; 
-USE  ieee.s(d_Iogic_arith.ALL; 
-USE  ieee.sldJogic_unsigned.ALL; 


LIBRARY  arithmetic: 

USE  arithmelic.std_Iogic_arith.ALL; 

ENTITY  regs_pea  IS 
GENERIC 

(width:  POSITIVE  ;=  16); 

PORT 


(SIGNAL  q 

;INOUTSIGNED(  width- 1  DOWNTOO); 

SIGNAL  d 

;IN 

SIGNED(  width- 1  DOWNTOO); 

SIGNAL  elk 

:IN 

STD_LOGIC; 

SIGNAL  ce 

;IN 

STD_LOGIC; 

SIGNAL  ares 
END  ENTITY  regs_pea; 

.TN 

STD_LOGIC); 

ARCHITECTURE  behav  OF  regs_pea  IS 

BEGIN 

reg:  PROCESS  (ares,  elk)  IS 

BEGIN 

IF  ares  =  T'  THEN 

q  <=  sign_extend("C)0",  width); 
ELSIF  ares  ='0'  THEN 
IF  (clk'event  and  elk  =T')  THEN 

IFce=T'THENq<=d; 

ELSE  q<=  q; 

END  IF; 

ELSE  q<=  q; 

END  IF; 

END  IF; 

END  PROCESS; 


END  ARCHITECTURE  behav; 


LIBRARY  icee; 

USE  icee.sid_Iogic_l  164.ALL; 

LIBRARY  arithmetic; 

USE  arithmetic. stdJogic^arith.ALL; 

ENTITY  twoscomp  IS 

GENERIC  (iwidth;  POSITIVE  :=8; 
owidth:  POSITIVE  := 1 6); 

PORT  (SIGNAL  input  :IN  SIGNED(iwidth-I  DOWNTO  0); 

SIGNAL  zero  ;IN  STD_LOGIC; 

SIGNAL  neg  :IN  STD_LOGIC; 

SIGNAL  output  :OUT  SIGNED(owidth- 1  DOWNTO  0)); 

END  ENTITY  twoscomp; 


ARCHITECTURE  behav  OF  twoscomp  IS 
BEGIN 

ext:  output  <=  sign_extend("00",  owidth)  WHEN  zero=T' ELSE 

sign_extend(-input.  owidth)  WHEN  zero='0'  AND  neg='  1'  ELSE 
sign_extend(input,  owidth) ;  -  WHEN  zero='0'  AND  neg=’0'; 

END  ARCHITECTURE  behav; 


LIBRARY  icec; 

USE  ieee.s(dJogic_1 1 64. all; 
USE  ieee.std_logic_ariih.all; 


ENTITY  correlator!  IS 

PORT  (  X,  Y  ;  IN  std_logic_vector(7  DOWNTO  0); 

elk, reset ;  IN  stdjogic; 


ROUT 

R 

multxy 

Xabs.Yabs 

count 

OVR 

END  correlator!; 

ARCHITECTURE  structural  of  correlator!  IS 


:  OUT  stdJogic_vector(31  DOWNTO  0); 

:  OUT  std_logic_vector(3 1  DOWNTO 
:  OUT  std_logic_vector(15  DOWNTO  0); 

;  OUT  std_logic_vector(7  DOWNTO  0); 

:  OUT  INTEGER; 

:  OUT  std_logic); 


FUNCTION  to_int(a:std_logic_vector(31  DOWNTO  0))  RETURN  integer  IS 
variable  val;  integer:=  0; 
variable  b:  integer;=  I ; 

variable  tmp  ;  std_logic_vector(3I  DOWNTO  0); 
variable  neg_sign,C :  std_logic; 

BEGIN 

IF(a(31)  =  ’I')  THEN 
neg_sign  :=  'I'; 
tmp  ;=  (NOT  ( a )); 

C;='l'; 

FOR  i  IN  0  TO  31  LOOP 

tmp(i)  ;=  tmp(i)  XOR  C; 

C  :=  tmp(i)  AND  C; 

END  LOOP; 

ELSE 

neg_sign  :=  'O'; 
tmp  :=  a; 

END  IF; 

FOR  i  IN  0  TO  31  LOOP 

if  (tmp(i)  = '!')  then 

val  :=  val  +  b; 

end  if; 
b  :=  b  2; 

END  LOOP; 


IF  (neg_sign  = '!')  THEN 
val  ;=  -val; 

END  IF; 
return  val; 

END  to_int; 


COMPONENT  abs8 
PORT(  a8 


o8 


:IN  stdJogic_vector(7  DOWNTO  0); 

:OUT  std_logic_vector(7  DOWNTO  0)); 


sproduci  <=  sgnx*$gny; 

sextend  <=  sign_extcnd(sproduci,  10); 


saccum:  ENTITY  work.accums(behav) 
mulliplied  signed  values  and  output  the 
GENERIC  MAP  (width  =>  10) 

stotal. 

PORT  MAP  (input  =>  sextend,  output  =>  stotal,  elk  =>  elk, 
ce  =>ce,  ares  =>  ares); 

piproduct  <=  stotal  *  "01 1001001"; 

places;  pi  is  estimated  by  11.001001  base2. 

longrsig  <=  piproduct  /  "010000000"; 
fhan[((N*2)*(2^6)],  this  will  leave  8 

rsignal  <=  Iongrsig(l  1  DOWNTO  0); 

(I'^O)  place.  Result  is  "[1 1:8]. [7:0]". 


-Accumulate  the 
-lObit  2scomp 


-pi*2''6  rounded  to  6 
-Divide  by  256  less 

—places  after  the 


END  ARCHITECTURE  behav; 


5.1.2  Hybrid  Sign  Correlation  Estimator 


The  following  is  the  VHDL  code  for  the  hybrid  sign  estimator.  Subroutines  for  this 
estimator  are  at  the  end  of  this  section.  The  hybrid  sign  estimator  requires  more 
computation  than  the  polarity  coincidence  but  fewer  than  the  quadruplex.  The  accuracy  is 
also  a  compromise  between  the  polarity  coincidence  and  the  quadruplex  estimators. 
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Implement  the  following  function  for  256  samples. 
R=  PPRl*R2/2 

Rl-(l/n)*SUM(IYnl)  l<=n<=256 

R2=(l/n)*SUM(Xn*sign(Yn))  l<=n<=256 


LIBRARY  ieee; 

USE  ieee.stdJogic_1 164. ALL; 

LIBRARY  arithmetic: 
USEarithmetic.stdJogic_arith.ALL: 
Graphics  Design  Library 

ENTITY  correlate2  IS 


PORT  (SIGNAL  xsignal  :IN  SIGNED(7  DOWNTO  0)- 
read  on  the  rising  edge  of  elk. 

SIGNAL  ysignal  :IN  SIGNED(7  DOWNTO  0)- 
read  on  the  rising  edge  of  elk. 

SIGNAL  elk  :IN  STD_LOGIC; 

SIGNAL  ce  .IN  STD_LOGIC; 

SIGNAL  ares  .TN  STD_LOGIC; 
resets  to  zero. 

SIGNAL  rsignal  ;OUT  SIGNED(19  DOWNTO  0)); 
comp  output  with  4  binary  places,  ie.  pos  values  look  like: 


2*12,  2*1 1,  2*10,  2*9,  2*8,  2*7,  2*6,  2*5, 
END  ENTITY  correIate2; 

2,1 . 2*-l,  2*-2,  2*-3,  2*-4]. 

ARCHITECTURE  behav  OF  correlate2  IS 


SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 


ysigl6bit  :SIGNED(I5  DOWNTO  0): 

xsig  1 6bit  :SIGNED(  1 5  DOWNTO  0) 

ytotal  ;SIGNED(15  DOWNTO  0) 

xtotal  :SIGNED(15  DOWNTO  0) 

xyproduct  :SIGNED(31  DOWNTO  0) 

piproduct  :SIGNED(40  DOWNTO  0) 

longrsig  :SIGNED(40  DOWNTO  0); 
yzero  :STD_LOGIC; 


BEGIN 

yzero  <=  T '  WHEN  ysignal="00000000"  ELSE  'O'- 

yzero]  determine  sign(Yn);  [d,l]=0,  (1,0]=-1,  [0,0]= 1. 


yextend:  ENTITY  work.twoscomp(behav) 

2s  comp  ysignal  to  the  16bit  2s  comp  ysigl6bit. 

GENERIC  MAP  (iwidth  =>  8,  o width  =>  16) 

PORT  MAP  (input  =>  ysignal.  zero  =>  'O',  neg  =>  ysignal(7). 


“-2s  comp  input, 

—2s  comp  input, 

-clock . 

-clock  enable, 
-asynchronous  reset, 

-2s 

-  [0,  2''14,  2'^13, 

-  2M,  2'^3,  2^2, 


-[ysignal(7), 


-Extend  the  8bit 


oulpul  =>  ysigI6bil); 


xexlend:  ENTITY  work.twoscoinp(behav) 

8bil  2s  comp  xsignal  by  sign(Yn)  and  extend  to  the 

GENERIC  MAP  (iwidth  =>  8,  owidth  =>  16) 
xsigI6bit. 

PORT  MAP  (input  =>  xsignal,  zero  =>  yzero,  neg  =>  ysignal(7), 

output  =>xsig  1 6bit); 

ysum:  ENTITY  work.accums(behav) 

Accumulate  the  ysigl6bit  and  output  the  I6bit  2scomp  y total. 

GENERIC  MAP  (width  =>  16) 

PORT  MAP  (input  =>  ysigl6bit,  output  =>  ytotal,  elk  =>  elk, 
ce  =>ce,  ares  =>  ares); 


-Multiply  the 
-16bit  2s  comp 


xsum:  ENTITY  work.accums(behav) 

Accumulate  the  xsigl6bit  and  output  the  16bit  2scomp  xtotal. 

GENERIC  MAP  (width  =>  16) 

PORT  MAP  (input  =>  xsig  1 6bit,  output  =>  xtotal,  elk  =>  elk, 

ce  =>ce,  ares  =>  ares); 

xyproduct  <=  ytotal  *  xtotal; 

products. 

piproduct  <=  xyproduct  *  "01 1001001 "; 

to  6  places;  pi  is  estimated  by  1 1.001001  base2. 

longrsig  <=  piproduct  /  "OlOOOOOOOOOOOOOOOOOOO"- 
than[((N*N*2)*(2^6)],  this  will  leave  4 

rsignal  <=  longrsig(  1 9  DOWNTO  0); 

(2^0)  place.  Result  is  "[19:4]. [3:0]". 


-Perform  the  final 

--pi*2^6  rounded 
-Divide  by  16  less 

-places  after  the 


END  ARCHITECTURE  behav; 


5.1.3  Quadruplex  (Bit  Serial)  Correlation  Estimator 

The  following  is  the  VHDL  code  for  the  quadroplex  estimator.  The  quadruplex  is  the 
most  accurate  nonlinear  estimator  while  having  less  computational  burden  than  the 
conventional  estimator. 
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-  Implement  the  following  function  for  256  samples. 

Rij=(pi/(2*n''2)sum(Fn)sum(Gn)  l<=n<=256 
Input  is  byte  serial 


LIBRARY  ieee; 

USE  ieee.std_logic_l  164. ALL; 

LIBRARY  arithmetic; 

USE  arithmetic.stdJogic_arith.ALL; 
Graphics  Design  Library 

ENTITY  quadrx_s  IS 


PORT  (SIGNAL  xysig  :IN  SIGNED(7  DOWNTO  0); 
read  on  the  rising  edge  of  elk. 

SIGNAL  elk  :IN  STD_LOGIC; 

SIGNAL  ce  ;IN  STD_LOGIC; 

SIGNAL  ares  :IN  STD_LOGIC; 
resets  to  zero. 

SIGNAL  rsignal  :OUT  SIGNED(I9  DOWNTO  0)); 
comp  output  with  4  binary  places,  ie.  pos  values  look  like: 


"2s  comp  input, 


—clock . 
—clock  enable, 
-asynchronous  reset, 

-2s 


2''12,  2''!  1,  2^10,  2''9,  2''8,  2'^?,  2^6, 2''5, 
END  ENTITY  quadrx.s; 

2,1 .2''-I,2''-2,  2M,  2M]. 


--  [0,2^14,  2^13, 
--  2M,  2^3,  2''2, 


ARCHITECTURE  behav  OF  quadrx_s  IS 


SIGNAL  xinput  :SIGNED(7  DOWNTO  0); 

--  Intermediate  signals  between  the  input  buffer  and  the  quadrx!_p 
SIGNAL  yinput  :SIGNED(7  DOWNTO  0); 

-  component. 

SIGNAL  clk2  :STD_LOGIC: 

Divide  by  2  clock,  transitions  occur  on  the  rising  edge  of  elk. 

SIGNAL  clk2inv  :STD_LOGIC; 

Inverse  of  clk2. 

SIGNAL  enclk2  :STD_LOGIC; 

enable  of  the  input  buffer  register  that  holds  the  y  value. 

SIGNAL  enclk2inv  :STD_LOGIC; 

enable  of  the  input  buffer  register  that  holds  the  x  value 
BEGIN 

clk2inv  <=  NOT  clk2; 

enclk2  <=  ce  AND  clk2; 

enclk2inv  <=  ce  AND  clk2inv; 


clockhalf:  ENTITY  work.dff_pea(behav) 

PORT  MAP  (q=>clk2,  d=>clk2inv,  clk=>clk,  ares=>ares 
output  is  clk2  and  clk2inv. 


,  ce=>ce); 


-  Clock 

-  Clock 


-  Clock  divider, 


regx:  ENTITY  work.regs_pea(bchav) 
register  that  latches  current  value  of  x. 

GENERIC  MAP  (width  =>  8) 
enter  xysig,  alternating  in  a  byte  serial 

PORT  MAP  (q  =>  xinput,  d  =>  xysig, 
the  X  value  on  the  rising  edge  of 

ares  =>  ares); 

and  clk2inv  =  I  (clk2  =0). 


--  Input  buffer 
X  and  y  values 

— >  elk,  ce  — >  enclk2inv,--  fashion.  Regx  latches 

—  elk  when  ce  =1 


regy:  ENTITY  work.regs_pea(behav) 
register  that  latches  current  value  of  y. 

GENERIC  MAP  (width  =>  8) 
enter  xysig,  alternating  in  a  byte  serial 

PORT  MAP  (q  =>  yinput,  d  =>  xysig,  elk  =>  elk,  ee  =>  enelk2, 
the  y  value  on  the  rising  edge  o 

ares  =>  ares); 

and  clk2  =  1, 


-  Input  buffer 

-  X  and  y  values 
-  fashion.  Regy  latches 

-  elk  when  ce  =1 


qrx:  ENTITY  work.quadrx_p(behav) 
estimator,  clocked  on  the  rising  edge  of  clk2. 

PORT  MAP  (xsignal  =>  xinput,  ysignal  =>  yinput,  elk  =>  clk2, 
and  regy  components. 

ce  =>  ce,  ares  =>  ares,  rsignal=>  rsignal); 


Quadruplex 

Inputs  are  from  the  regx 


END  ARCHITECTURE  behav; 


-  Implement  the  following  function  for  256  samples.  - 
Rij=(pi/(2*n''2)sum(Fn)sum(Gn)  l<=n<=256 
Input  is  byte  parallel 


LIBRARY  ieee; 

USE  ieee.stdJogic_l  164. ALL; 

LIBRARY  arithmetic; 

USE  arithmetic.stdJogic_arith.ALL; 

Graphics  Design  Library 

ENTITY  quadrx_p  IS 

PORT  (SIGNAL  xsignal  :IN  SIGNED(7  DOWNTO  0); 
read  on  the  rising  edge  of  elk. 

SIGNAL  ysignal  :IN  SIGNED(7  DOWNTO  0); 
read  on  the  rising  edge  of  elk. 

SIGNAL  elk  :IN  STD_LOGIC; 

SIGNAL  ce  :IN  STD_LOGIC; 

SIGNAL  ares  :IN  STD_LOGIC; 
resets  to  zero. 

SIGNAL  rsignal  ;OUT  SIGNED(I9  DOWNTO  0)); 
comp  output  with  4  binary  places,  ie.  pos  values  look  like: 

2''12,  2^1 1,  2^10,  2^9,  2^8,  2^7,  2''6,  2''5, 

END  ENTITY  quadrx_p; 

2,1 .2^-1,  2^-2,  2M,  2M]. 

ARCHITECTURE  behav  OF  quadrx_p  IS 


—Mentor 


-2s  comp  input, 
—2s  comp  input. 


-clock . 
—clock  enable, 
-asynchronous  reset, 

-2s 


-  [0,2^14,  2'^13, 

-  2M,  2^3,  2^2, 


SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

SIGNAL 

BEGIN 


finput  :SIGNED(7  DOWNTO  0); 

ginput  :SIGNED(7  DOWNTO  0); 

fsigl6bit:SIGNED(I5  DOWNTO  0); 
gsigl6bit  ;SIGNED(15  DOWNTO  0); 

ftotal  ;SIGNED(15  DOWNTO  0); 

gtotal  :SIGNED(15  DOWNTO  0); 

fgproduct  :SIGNED(31  DOWNTO  0); 

piproduct  ;SIGNED(40  DOWNTO  0); 

longrsig  :SIGNED(40  DOWNTO  0); 
invertxy  ;STD_LOGIC; 
crossxy  :STD_LOGIC: 


crossxy  <_'  l '  WHEN  abs(xsignal)  >=  abs(ysignal)  ELSE  'O'- 

invertxy  <=(crossxy  AND  xsignal(7))  or  ((NOT  crossxy)  AND  ysignal(7)); 


END  COMPONENT; 


COMPONENT  multS 
PORT(  a8 

END  COMPONENT; 


:IN  std_logic_vecior(7  DOWNTO  0); 
b8  ;IN  std_logic_veclor(7  DOWNTO  0); 
c  1 6  ;OUT  sid_logic_vector( 1 5  DOWNTO  0)); 


COMPONENT  add_sub32 
PORT(  a32 

b32 

add_nsub 

c32 

OVR 

END  COMPONENT; 


;IN  sid_logic_vector(31  DOWNTO  0); 

:IN  sldJogic_vector(31  DOWNTO  0); 
;1N  std_logic; 

:OUT  std_Iogic_vector(3 1  DOWNTO  0); 
:OUT  stdjogic); 


COMPONENT  reg32 

PORT  (  elk,  reset  ;IN  std_Iogic; 

D 
Q 

END  COMPONENT; 


:IN  stdJogic_vector(31  DOWNTO  0); 
:OUT  stdJogic_vector(31  DOWNTO  0)); 


CONSTANT N 
CONSTANT  sign_ext 
"0000000000000000"; 

SIGNAL  X_tmp,Y_tmp 

SIGNAL  mtmpla 

SIGNAL  mtmplb,  mtmp2 

SIGNAL  acetmp  1 ,  accmtmp2,  accmtmp3  :std. 

SIGNAL  itmpl,itmp2,itmp3 

SIGNAL  sign_out 

SIGNAL  countin,  countout 


:INTEGER  :=  256; 
:stdJogic_vector(15  DOWNTO  0):= 

:std_logic_vector(7  DOWNTO  0); 

:std_logic_vector(15  DOWNTO  0); 
:stdJogic_vector(31  DOWNTO  0); 
.Iogic_vector(31  DOWNTO  0); 

INTEGER; 

:std_logic; 

:INTEGER  :=  0; 


BEGIN 


sign_out  <=  NOT(X(7)  XOR  Y(7)); 
absl:  abs8  PORTMAP(X,  X_tmp); 
Xabs  <=  X_tmp; 

abs2:  abs8  PORTMAP(Y,  Y_tmp); 
Yabs  <=  Y_tmp; 


multrl;  mult8  PORT  MAP  (X_tmp,  Y_tmp,  mtmpla); 

multxy  <=  mtmpla; 

mtmplb  <=  sign_ext  &  mtmpla; 

addrl:  add  sub32  PORT  MAP  (accmtmp3,  mtmplb,  sign_out,  accmtmp2,  OVR); 
regl:  reg32  PORT  MAP(cIk,  reset,  accmtmp2,  accmtmpS); 
itmpl  <=  to_int(accmtmp3); 
itmp2  <=  itmpl  /N; 


mtmp2  <=  CONV_STD_LOGIC_VECTOR(itmp2,32); 

R  <=  accmtmp3; 

ROUT  <=  mtmp2; 


cnt:PROCESS(clk,reset, countin) 

VARIABLE  countv:  INTEGER- 

BEGIN 

countv  :=  countin; 
if(reset  =  'l')  then 


countv  :=  0; 

elsif  (clk'EVENT  and  elk  =  T)  THEN 
countv  :=  countv  +  1; 

end  if; 

countout  <=  countv; 

END  PROCESS  ent; 

countin  <=  countout; 
count  <=  countout; 


END  structural; 


LIBRARY  ieee; 

USE  iece.stdJogic_l  1 64. all; 

USE  ieee.stdJogic_arith.a!l; 

-hybrid  sign 

ENTITY  correlator2  IS 

PORT(  X.Y  .-INstd 

elk, reset 
Y_OVR,  X 
count 
RY,RX 
ROUT 

END  correlator2; 

ARCHITECTURE  structural  of  correlator2  IS 


logic_vector(7  DOWNTO  0); 

:  IN  stdjogic; 

;  OUT  std_logic; 

:  OUT  INTEGER; 

:  OUT  stdJogic_vector(15  DOWNTO  0); 

;  OUT  stdJogic_vector(3 1  DOWNTO  0)); 


FUNCTION  toJnt(a:stdJogic_vector(15  DOWNTO  0))  RETURN  integer  IS 
variable  val;  integer:=  0; 
variable  b:  integer:=  1 ; 

variable  tap  :  std_logic_vector(15  DOWNTO  0); 
variable  neg_sign,C :  stdjogic; 

BEGIN 

IF  (a(15)  =  T’)  THEN 
neg_sign  :=  'T; 
tap  :=  (NOT  (  a )); 

C:=T’: 

FOR  i  IN  0  TO  15  LOOP 

tap(i)  tmp(i)  XOR  C; 

C  :=  tmp(i)  AND  C; 

END  LOOP; 

ELSE 

neg_sign  :=  'O'; 
tap  :=  a; 

END  IF; 


FOR  i  IN  0  TO  15  LOOP 

if  (tmp(i)  = '!')  then 

val  ;=  val  +  b; 

end  if; 
b  :=  b  2; 

END  LOOP; 


IF  (neg_sign  =  '!')  raEN 
val  :=  -val; 

END  IF; 


return  val; 

END  tojnt; 

IN  std_logic_vector(7  DOWNTO  0); 

:OUT  std_logic_vector(7  DOWNTO  0)); 


COMPONENT  abs8 
PORT(  a8 

o8 


END  COMPONENT; 


COMPONENT  mult  16 

PORT(  al6  ■■INstdJogic_vector(15  DOWNTOO); 

bl6  :IN  sldjogic_vector(15  DOWNTO  0); 
c32  ;OUTstdJogic_vector(3i  DOWNTOO))- 
END  COMPONENT;  ^  ' 


COMPONENT  add_sub  16 
PORT(  al6 


END  COMPONENT; 

COMPONENT  reg  16 
PORT  (  elk,  reset 


bl6 

add_nsub 

cl6 

OVR 


:IN  stdJogic_vector(15  DOWNTO  0); 

:IN  stdJogic_vector(15  DOWNTO  0); 
:IN  std_logic; 

:OUT  std_logic_vector(15  DOWNTO  0); 
:OUT  std_logic); 


END  COMPONENT; 


D 

Q 


:IN  std_Iogic; 


:IN  std_logic_vector(15  DOWNTO  0); 
:OUT  stdJogic_vector(15  DOWNTO  0)); 


CONSTANT N 
CONSTANT  pi 

DOWNTO  0):="00000001 1001 1001"; 

CONSTANT  pLshft 
CONSTANT  sign_ext 
0):=  "00000000"; 

SIGNAL  Y_tmpl,  X_tmpl,  X_tmp 
SIGNAL  Y_tmp2,X_Imp2,X_tmp3,Y_tmp3  :std. 
SIGNAL  Y.accum  1 ,  Y_accum2,  RY_tmp  :std' 
SIGNAL  X_accum  I ,  X_accum2,  RXtmp  -std 
SIGNAL  Y_as  ’  ' 

SIGNAL  X_as 

SIGNAL  itmp  1  ,itmp2,itmp3,itmp4,i(mp5,itmp6 
SIGNAL  countin,  countout 
SIGNAL  RX_tmp_pi  1 
SIGNAL  RX_tmp_pi2,tmp_pi 


:INTEGER  :=  256; 
;std_logic_vector(15 

;INTEGER  :=  128; 
:std_logic_vector(7  DOWNTO 

:std_Iogic_vector(7  DOWNTO  0); 
.logic_vector(15  DOWNTO  0); 

Jogic_vector(15  DOWNTO  0); 

.logic_vector(15  DOWNTO  0); 

:std_Iogic; 

rstdjogic  :=  T; 
:INTEGER; 

:INTEGER  :=  0; 

:std_logic_vector(31  DOWNTO  0); 
;stdJogic_vector(15  DOWNTO  0); 


BEGIN 


—Calculation  of  R1 
Y_as  <=■!'; 

absl:  abs8  PORTMAP(Y,  Y_tmpl); 

Y_tmp2  <=  sign_ext  &  Y_tmpl; 

addrl:  add_subl6  PORT  MAP  (Y_accum2,Y_tmp2,Y_as,Y_accuml,Y_OVR)- 
regl:  regl6  PORT MAP(clk,  reset,  Y_accuml,  Y_accum2)- 
itmpl  <=  toJnt(Y_accum2); 
itmp2  <=  itmpl  /  N; 

Y_tmp3  <=  CONV_STD_LOGIC_VECTOR(itmp2  16)- 
RY  <=  Y_accum2; 

RY_tmp  <=  Y_tmp3; 


—Calculation  of  R2 

abs2:  abs8  PORT  MAP(X,  X_tmp); 


sign_y:  PROCESS(Y) 

BEGIN 

if  (Y  =  sign_ext)  then 

X_tmpl  <=  sign_ext; 

else 

X_as  <=  NOT(Y(7)  XOR  X(7)); 

X_tnipl<=  X__imp; 

end  if; 

END  PROCESS  sign_y; 

X_tmp2  <=  sign^ext  &  X_tmpl; 

addr2:  add_subl6  PORT  MAP  (X_accum2,X_tmp2,X_as,X_accumI,X_OVR) 
reg2:  reg  1 6  PORT  MAP(cIk,  reset,  X.accum  I ,  X_accum2); 
itmp3  <=  to_int(X_accum2); 
itmp4<=  itmp3/N: 

X_tmp3  <=  CONV_STD_LOGIC_VECTOR(itmp4,16)- 
RX  <=  X_accum2; 

RX_tmp  <=  X_tinp3; 

cnt:PROCESS(clk, reset, countin) 

VARIABLE  countv:  INTEGER; 

BEGIN 

countv  :=  countin; 
if  (reset  =  T’)  then 

countv  :=  0; 

elsif  (clk'EVENT  and  elk  =  T')  THEN 
countv  :=  countv  +  1 ; 

end  if; 

countout  <=  countv; 

END  PROCESS  ent; 

countin  <=  countout; 
count  <=  countout; 

--tnip_pi  <=  pi; 

-multpi:  multie  PORT  MAP  (RX_tmp.tmp_pi,RX_tmp_pil); 

-itmp5  <=  toJnt(RX_tmp_pi  1(15  DOWNTO  0)); 

-itmp6<=  itmp5/  pi_shft; 

-RX_tmp_pi2  <=  CONV_STD_LOGIC_VECTOR(itnip6,16); 
multi:  multl6  PORT  MAP  (RX_tmp,RY_tmp,ROUT); 


END  structural; 


LIBRARY  icee; 

USE  ieee.sldJogic_l  164.all; 

USE  ieee.sldJogic_arilh.all; 

-polarity  coin  serial 

ENTITY  correlators  IS 

PORT(  X,Y  ;INsid. 

elk,  reset 
OVR 
count 
ROUT 

END  correlators; 

ARCHITECTURE  structural  of  correlators  IS 


logic_vector(7  DOWNTO  0); 

:  IN  sld_logic; 

;  OUT  std_logic; 

:  OUT  INTEGER; 

:  OUT  stdJogic_vector(15  DOWNTO  0)); 


FUNCTION  toJnt(a:std_Iogic_vector(15  DOWNTO  0))  RETURN  integer  IS 
variable  val:  integer:=  0; 
variable  b;  integer:=  1 ; 

variable  tmp  :  stdJogic_vector(I5  DOWNTO  0); 
variable  neg_sign,C :  std_logic; 

BEGIN 

IF  (a(15)  =  T')  THEN 
neg_sign  :=  T’; 
tmp  :=  (NOT  ( a )); 

C:='r; 

FOR  i  IN  0  TO  15  LOOP 

tmp(i)  :=  tmp(i)  XOR  C; 

C  :=  tmp(i)  AND  C; 

END  LOOP; 

ELSE 

neg_sign  :=  'O'; 
tmp  :=  a; 

END  IF; 


FOR  i  IN  0  TO  15  LOOP 

if(tmp(i)  =  ’l')then 

val  :=  val  +  b; 

end  if; 
b  :=  b  2; 

END  LOOP; 

IF  (neg_sign  =  '1')  THEN 
val  :=  -val; 

END  IF; 
return  val; 

END  to_int; 


IN  stdJogic_vector(15  DOWNTO  0); 

:IN  stdJogic_vector(I5  DOWNTO  0); 
IN  std_logic; 

■OUT  stdJogic_vector(15  DOWNTO  0); 


COMPONENT  add_sub  16 
PORT(  al6 

bl6 

add_nsub 

cl6 


END  COMPONENT; 


OVR 


;OUT  sld_logic); 


COMPONENT  reg  16 

PORT(  elk,  reset  .INstdJogic; 

D 

Q 

END  COMPONENT; 


:1N  std_logic_vector(l5  DOWNTO  0); 
:OUT  stdJogic_vector(I5  DOWNTO  0)); 


:INTEGER  :=  256; 
:std_logic_veclor(7  DOWNTO  0) 

:std_Iogic_vector(7 

;std_Iogic_vector(I5  DOWNTO  0); 
:sld_logic; 

INTEGER; 

std_logic_vector(15  DOWNTO  0); 

••INTEGER; 

BEGIN 

sign_mult:  PROCESS(Y,X) 

BEGIN 

if  (Y  =  sign_ex0  OR  (X  =  sign_ext)  then 
R_tmp  <=  sign_ext  &  sign_ext; 
as  <=  T; 

elsif  (Y(7)  =  'T)  AND  (X(7)  =  ']')  then 
R_tmp  <=  ONE; 
as<=T'; 

elsif  (Y(7)  =  '0')  AND  (X(7)  =  '0')  then 
R_tmp  <=  ONE; 
as<=T'; 

else 

R_tmp  <=  ONE; 
as  <=  'O'; 

end  if; 

END  PROCESS  sign.mult; 


CONSTANT  N 
CONSTANT  sign_ext 
;=  "00000000"; 

CONSTANT  ONE 

DOWNTO  0)  :=  "0000000000000001"; 
SIGNAL  R_tmp 
SIGNAL  as 

SIGNAL  countout,  countin 
SIGNAL  accuml,accum2,tmp 
SIGNAL  itmpl,  itmp2 


addrl:  add_subl6  PORT  MAP  (accum2,R_tmp,as,accuml,OVR)- 
regl:  regl6  PORT  MAP(cIk,  reset,  accuml,  accum2); 
itmpl  <=  to_int(accum2); 
itmp2 <=  itmpl  / N; 

tmp  <=  CONV_STD_LOGIC_VECTOR(itmp2,16)- 

ROUT<=tmp;  ^  V  ,  J, 


cnt;PROCESS(cIk,reset,countin) 

VARIABLE  countv:  INTEGER- 

BEGIN 

countv  ;=  countin; 
if  (reset  =  '!')  then 

countv  :=  0; 

elsif  (clk’EVENT  and  clk=  T)  THEN 
countv  countv  +  1 ; 


end  if; 


countout  <=  countv 
END  PROCESS  cnt; 

countin  <=  countout; 
count  <=  countout; 


END  structural; 


LIBRARY  ieee; 

USE  ieee.stdJogic_1 164.all; 
USE  ieee.stdJogic_ariih.aIl; 

-polarity  coin  parallel 

ENTITY  correlators  IS 
PORT(  XI,  YI 


END  correlators; 

ARCHITECTURE  structural  of  correlators  IS 

FUNCTION  to_int(a:std_iogic_vector(15  DOWNTO  0))  RETURN  integer  IS 
variable  val:  inleger;=  0; 
variable  b:  integer—  1 ; 

variable  tmp  :  std_logic_vector(15  DOWNTO  0); 
variable  neg_sign,C :  stdjogic; 

BEGIN 

IF(a(I5)  =  'l')THEN 
neg_sign  := 
tmp  :=  (NOT  ( a )); 

C-T; 

FOR  i  IN  0  TO  15  LOOP 

tmp(i)  ;=  tmp(i)  XOR  C; 

C  :=  tmp(i)  AND  C; 

END  LOOP; 

ELSE 

neg_sign  :=  'O'; 
tmp  :=  a; 

END  IF; 

FOR  i  IN  0  TO  15  LOOP 

if  (tmp(i)  =  '1')  then 

val  :=  val  +  b; 

end  if; 
b  :=  b  ♦  2; 

END  LOOP; 

IF  (neg_sign  =  '!')  -raEN 
val  :=  -val; 

END  IF; 
return  val; 

END  tojnt; 


IN  stdJogic_vector(15  DOWNTO  0); 

:IN  stdJogic_vector(15  DOWNTO  0); 
IN  std_logic; 


COMPONENT  add_subl6 
PORT(  al6 

bl6 

add  nsub 


;  IN  sld_logic_vector(7  DOWNTO  0); 

^2.  Y2  ;  IN  std_logic_vector(7  DOWNTO  0); 

elk, reset  ;  IN  std_logic; 

C>VR  :  OUT  stdjogic; 

count  :  OUT  INTEGER; 

:  OUT  std_logic_vector(  15  DOWNTO  0)); 


END  COMPONENT; 


cl6 

OVR 


:OUT  sid_Iogic_vector(l5  DOWNTO  0); 
:OUT  sld_logic); 


COMPONENT  reg  1 6 

PORT  (  elk,  reset  :IN  std_logic; 

D 

Q 

END  COMPONENT; 


:IN  std_logic_vector(15  DOWNTO  0); 
:OUT  sldJogic_vector(I5  DOWNTO  0)); 


CONSTANT  N 
CONSTANT  sign_ext 
;=  "00000000"; 

CONSTANT  sign_ext2 
:=  "0000000000000000"; 

CONSTANT  ONE 

DOWNTO  0)  ;=  "0000000000000001 "; 
SIGNAL  R_tmpl,  R_tmp2 
SIGNAL  as,  asO,  asl,  as2 
SIGNAL  countout,  countin 
SIGNAL  accumO,  accuml,accuni2,tmp 
SIGNAL  itmpl,  itnip2 
SIGNAL  OVRO 


:INTEGER  :=  256; 
:std_logic_vector(7  DOWNTO  0) 

:std_logic_vector(15  DOWNTO  0) 

:std_logic_vector(  1 5 

:std_logic_vector(15  DOWNTO  0); 


:std_logic; 

INTEGER; 

:std_logic_vector(15  DOWNTO  0); 

:INTEGER: 

:std_logic; 


BEGIN 

sign_multl:  PR0CESS(Y1,X1) 

BEGIN 

if  (Y1  =  sign_ext)  OR  (XI  =  sign_ext)  then 
R_tmpl  <=  sign_ext  &  sign_ext; 
asl  <=  T; 

elsif  (Yl(7)  =  ’T)  AND  (Xl(7)  =  '1')  then 
R_tmpl  <=  ONE; 
asl  <=  T; 

elsif  (Yl(7)  =  '0')  AND  (Xl(7)  =  '0')  then 
R_tmpl  <=  ONE; 
asl  <=  T'; 
else 

R_tmpl  <=  ONE; 
asl  <=  'O'; 

end  if; 

END  PROCESS  sign_multl; 


sign_mult2:  PROCESS(Y2,X2) 

BEGIN 

if  (Y2  =  sign_ext)  OR  (X2  =  sign_ext)  then 
R_tmp2  <=  sign_ext  &  sign_ext; 
as2  <= '!'; 

elsif  (Y2(7)  =  '!')  AND  (X2(7)  =  '!')  then 
R_tmp2  <=  ONE; 
as2  <='!'; 

elsif  (Y2(7)  =  '0')  AND  (X2(7)  =  '0')  then 
R_tmp2  <:=  ONE; 
as2<='r; 


else 


R_tmp2  <=  ONE; 


as2  <=  ’0*; 

end  if; 

END  PROCESS  sign_mu!t2; 

asO  <=  NOT(  asl  XOR  as2); 
as  <='!'; 

addrO;  add_subI6  PORT  MAP  (Rjmpl,R_tmp2,asO,accumO,OVRO); 
addrl:  add_subl6  PORT  MAP  (accum2,accumO,as,accuml,OVR); 
regl:  regl6  PORT  MAP(cIk,  reset,  accuml,  accum2); 
itmpl  <=toJnt(accum2); 
itmp2  <=  itmpl  /N; 

imp  <=  CONV_STD_LOGIC_VECTOR(itmp2, 16)- 
ROUT<=tmp: 


cnt:PROCESS(cIk,reset,countin) 

VARIABLE  countv:  INTEGER- 

BEGIN 

countv  :=  countin; 
if  (reset  =  '!')  then 

countv  :=  0; 

elsif  (clk'EVENT  and  elk  =  ’I’)  THEN 
countv  :=  countv  +  I; 

end  if; 

countout  <=  countv; 

END  PROCESS  ent; 

countin  <=  countout; 
count  <=  countout; 


END  structural; 


LIBRARY  icee; 

USE  ieec.sld_logic_l  164. all; 
USE  ieee.sid_logic_arith.all; 

-quad  serial 


ENTITY  correlator4  IS 
PORT  (  X,  Y 


END  correIator4; 


:  IN  signed(7  DOWNTO  0); 
clk,reset  :  IN  std_logic; 


OVR 

-count 

f_tmpl,g_tmpl 

f_out,g_out 

ROUT 


:  OUT  std_logic; 

:  OUT  INTEGER; 

:  OUT  signed(7  DOWNTO  0); 

:  OUT  INTEGER; 

;  OUT  std_logic_vector(3 1  DOWNTO  0)); 


ARCHITECTURE  structural  of  correlator4  IS 


FUNCTION  to_intl6(a:stdJogic_vector(15  DOWNTO  0))  RETURN  integer  IS 
variable  val;  integer:=  0; 
variable  b;  integer;=  I; 

variable  tmp  ;  stdJogic_vector(15  DOWNTO  0); 
variable  neg_sign,C ;  stdjogic; 

BEGIN 

IF(a(15)  =  T’)  THEN 
neg_sign  :=  T’; 
tmp  :=  (NOT  ( a )); 

C:=  T'; 

FOR  i  IN  0  TO  15  LOOP 

tmp(i)  ;=  tmp(i)  XOR  C; 

C:=tmp(i)ANDC: 

END  LOOP; 

ELSE 

neg_sign  :=  'O’; 
tmp  :=  a; 

END  IF; 

FOR  i  IN  0  TO  15  LOOP 

if(tmp(i)  =  'l')then 

val  :=  val  +  b; 

end  if; 
b  :=  b  *  2; 

END  LOOP; 

IF  (neg_sign  =  '1')  THEN 
val  :=  -val; 

END  IF; 
return  val; 

END  to_intl6; 

COMPONENT  multl6 

PORT(  al6  :IN  std_logic_vector(15  DOWNTO  0); 


END  COMPONENT; 


bI6 

c32 


:IN  stdJogic_vector(15  DOWNTOO); 
•OUT  std_logic_vector(3I  DOWNTO  0)); 


COMPONENT  add_subI6 

PORT(  aI6,  bl6;IN  sidJogic_vector(I5  DOWNTO  0); 

add_nsub  :IN  std_!ogic; 


END  COMPONENT; 

COMPONENT  reg  16 

PORT  (  elk,  reset 


cl6 

OVR 


END  COMPONENT; 

COMPONENT  abs  16 
PORT(  al6 

END  COMPONENT; 


D 

Q 


ol6 


:IN  std_logic; 


;OUT  stdJogic_vector(15  DOWNTO  0); 
:OUT  std_logic); 


:IN  stdJogic_vector(15  DOWNTO  0); 
:OUT  stdJogic_vector(15  DOWNTO  0)); 


:IN  stdJogic_vector(I5  DOWNTO  0); 

:OUT  stdJogic_vector(15  DOWNTO  0)); 


CONSTANT N 
SIGNAL  countin,  countout 
SIGNAL  f_accunil,f_accum2,  f_accum3,  f_tmp 
SIGNAL  g_accuml,g_accuni2,  g_accum3,  g_tmp 
SIGNAL  f_accum,  g_accum 
SIGNAL  X_int,  Y_int,  R_int 
SIGNAL  fjnt,  fjntl,  fjnt2,  f_int3  iINTEGER; 

SIGNAL  gjnt,  gjntl,  g_int2,  g_int3  rINTEGER; 

SIGNAL  asf,  asg,  OVRF,  OVRG  :std_logic;  ’ 


:INTEGER  :=  256; 

.•INTEGER  :=  0; 

:stdJogic_vector(I5  DOWNTO  0) 
:std_logic_vector(15  DOWNTO  0) 
:std_logic_vector05  DOWNTO  0) 
.•INTEGER; 


BEGIN 

quad_func:  PROCESS(X,Y) 

VARIABLE  tmpf,  tmpg  ;  signed(7  DOWNTO  0); 

VARIABLE  xin,  xin_n,  yin,  yin_n  :  signed(7  DOWNTO  0); 

VARIABLE  xin_t,  yin_t :  integer- 

BEGIN 

xin  :=  X; 

xin_t  :=  CONV_INTEGER(X); 
xin_n  :=  CONV_SIGNED(-xin_t,8); 

yin  ;=  Y; 

yin_t  :=  CONV_INTEGER(Y); 
yin_n  :=  CONV_SIGNED(-yin_t,8); 

if  ( xin_n  <=  yin  )  AND  ( yin  <=  xin)  then 
tmpf  :=  yin; 
tmpg  xin; 

elsif  ( yin_n  <=  xin  )  AND  ( xin  <=  yin)  then 
tmpf  :=  xin; 
tmpg  ;=  yin; 

elsif  (  xin_n  >=  yin  )  AND  ( yin  >=  xin)  then 
tmpf  ;=  yin_n; 
tmpg  :=  xin_n: 

elsif  (  yin_n  >=  xin  )  AND  (  xin  >=  yin)  then 


tmpf  :=  xin_n; 

Impg  :=  yin_n; 

end  if; 

f_tmpl  <=  tmpf; 
g_tmpl  <=  tmpg; 

Caecum  <=  CONV_STD_LOGIC_VECTOR(lmpf,  16)- 
g_accum  <=  CONV_STD_LOGIC_VECTOR(tmpg,  16); 

END  PROCESS  quad_func; 

asf  <=  NOT(f_accum(I5)); 
asg  <=  NOT(g_accum(15)); 

abs  1 :  abs  1 6  PORT  MAP  (Caecum,  CaccumS); 
abs2;  abs  16  PORT  MAP  (g_accum,  g_accum3); 

addf:  add_subl6  PORT  MAP(Caccum2,  CaccumS,  asf,  Caecum  1,  OVRF); 
addg:  add_subl6  PORT MAP(g_accum2,  g_accum3,  asg,  g_accuml,  OVRG); 

regl:  regl6  PORTMAP(clk,  reset,  Caccuml,  Caccum2); 
reg2:  regl 6  PORT  MAP(clk,  reset,  g_accuml,  g_accum2); 

Cint2  <=  to_intl6(Caccum2); 
g_int2  <=  to_intl6(g_accum2); 

fJnt3<=fJnt2/N; 
gjnt3  <=  g_int2  /  N; 

Cout  <=  fjnt3; 
g_out  <=  gjnt3; 

Ctmp  <=  CONV_STD_LOGIC_VECTOR(Cint3,16); 
g_tmp  <=  CONV_STD_LOGIC_VECTOR(g_int3,16); 

OVR  <=  OVRF  OR  OVRG; 

multi:  multie  PORT  MAP  (Ctmp,g_tmp,ROUT); 


"Cnt:PROCESS(clk,reset,countin) 

VARIABLE  countv:  INTEGER; 
-BEGIN 

countv  :=  countin; 

“if  (reset  =  T)  then 
countv  :=  0; 

-elsif  (clk'EVENT  and  elk  =  '1')  THEN 
countv  ;=  countv  +  1; 

-end  if; 

countout  <=  countv; 

-END  PROCESS  ent; 


—countin  <=  countout; 
—count  <=  countout; 


END  structural; 


LIBRARY  ieec; 

USE  ieee.sldJogic_1 164. all; 
USE  ieee.sld_Iogic_ariih.aIl; 

-quad  parallel 

ENTITY  corre}ator4  IS 
P0RT(  XI,  Y1 


END  correlator4; 

ARCHITECTURE  structural  of  correlator4  IS 

FUNCTION  sign_ext(a:signed(7  DOWNTO  0))  RETURN  signed  IS 
VARIABLE  tmp:  signed(15  DOWNTO  0)- 

BEGIN 

if(a(7)  =  ’r)then 

tmp  :="Hllllir’&a; 
else  tmp  :=  "00000000"  &  a; 
end  if; 
return  tmp; 

END  sign_ext; 

FUNCTION  toJntl6(a:stdJogic_vector(15  DOWNTO  0))  RETURN  integer  IS 
variable  val;  integer;=  0; 
variable  b:  integer:^  1 ; 

variable  tmp  :  stdJogic_vector(15  DOWNTO  0); 
variable  neg_sign,C :  std_logic; 

BEGIN 

IF(a(15)  =  ’l’)  THEN 
neg_sign 

tmp  :=  (NOT  ( a )); 

C:='r; 

FOR  i  IN  0  TO  15  LOOP 

tmp(i)  :=  tmp(i)  XOR  C; 

C  :=  tmp(i)  AND  C; 

END  LOOP; 

ELSE 

neg_sign  :=  'O'; 
tmp  ;=  a; 

END  IF; 

FOR  i  IN  0  TO  15  LOOP 

if(tmp(i)  =  'r)  then 

val  :=  val  +  b; 

end  if; 
b;=b  *2; 

END  LOOP; 


:  IN  signed(7  DOWNTO  0); 

^2.  Y2  ;  IN  signed(7  DOWNTO  0); 

elk, reset  ;  IN  std_logic; 

;  OUT  stdjogic; 

f_tmp  1  ,g_tmp  1  :  OUT  std_logic_vector(  1 5  DOWNTO  0); 

f-Out,g_out  :  OUT  INTEGER; 

:  OUT  std_logic_vector(31  DOWNTO 


IF  (neg_sign  =  '1')  THEN 


val  :=  -val; 

END  IF; 
return  val; 

END  to_intl6; 

COMPONENT  mull  1 6 

PORT(  al6  ;INsidJogic_vector(15  DOWNTOO); 

bI6  :IN  stdJogic_vector(15  DOWNTO  0); 
c32  :OUT  stdJogic_vector(3 1  DOWNTO  0))- 

END  COMPONENT; 

COMPONENT  add_subl  6 

PORT(  al6,  bl6;IN  stdJogic_vector(  15  DOWNTO 0); 
add_nsub  :IN  stdjogic; 


END  COMPONENT; 

COMPONENT  reg  16 

PORT  (  elk,  reset 


cl6 

OVR 


END  COMPONENT; 

COMPONENT  abs  16 
PORT(  al6 

END  COMPONENT; 


D 

Q 


ol6 


:IN  std_logic; 


:OUT  stdJogic_vector(15  DOWNTO  0); 
:OUT  std_logic); 


:IN  stdJogic_vector(15  DOWNTO  0); 
:OUT  stdJogic_vector(15  DOWNTO  0)); 


:IN  stdJogic_vector(15  DOWNTO  0); 

••OUT  stdJogic_vector(15  DOWNTO  0)); 


^  :INTEGER  :=  256; 

blUNAL  countin,  countout  iINTEGER  •=  0* 

f_accuml,f_accum2,  f_accum3,  f_tmp  :stdJogic_vector(15  DOWNTO  0); 

DOWNTO  OT  g_accumh2,  f_accun*2  :stdJogic_vector(15 

SIGNAL  g_accuml,g_accuni2,  g_accum3.  g_tmp,  g_accumh,  g_accuml 
:std_logic_vector(15  DOWNTO  0); 

DOWNTO  f-accumh.  Laccuml  :std_logic_vector(15 

SIGNAL  X_int,  Y_int,  R_int  TNTEGER; 

SIGNAL  f_int,  fjntl,  fjnt2,  fjnt3  rINTEGER; 

SIGNAL  g_int,  gjntl ,  g_int2,  g_int3  iINTEGER; 

SIGNAL  asf  1 ,  asg  1,  asf2.  asg2,  OVRF,  0 VRG,  OVRFt,  OVRGt  rstdjogic; 


BEGIN 

quad_funcl:  PR0CESS(X1,Y1) 

VARIABLE  tmpf,  tmpg  :  signed(7  DOWNTO  0); 
VARIABLE  f_accumlt,  g_accumlt :  signed(15  DOWNTO  0); 
VARIABLE  xin,  xin_n,  yin,  yin_n  :  signed(7  DOWNTO  0);  ' 
VARIABLE  xin_t,  yin_t :  integer; 

BEGIN 

xin  :=X1; 

xin_t  :=  CONV_INTEGER(Xl); 


xin_n  :=  C0NV_SIGNED(-xin_t,8); 

yin  :=Y1; 

yin_t  :=  CONV_INTEGER(Y  I ); 
yin_n  :=  C0NV_SIGNED(-yin_t,8); 

if  (  xin_n  <=  yin  )  AND  (  yin  <=  xin)  then 
tnipf  :=  yin; 
tmpg  ;=  xin; 

elsif  ( yin_n  <=  xin  )  AND  ( xin  <=  yin)  then 
tmpf  :=  xin; 
tmpg  :=  yin; 

elsif  (  xin_n  >=  yin  )  AND  ( yin  >=  xin)  then 
tmpf  :=  yin_n; 
tmpg  :=  xin_n; 

elsif  ( yin_n  >=  xin  )  AND  ( xin  >=  yin)  then 
tmpf  :=  xin_n; 
tmpg  ;=  yin_n; 

end  if; 


f.accumlt  :=  sign_ext(tmpO; 

Laccuml  <=  C0NV_STD_L0GIC_VECT0R(f_accumIt,16); 
g_accumlt  :=  sign_ext(tmpg); 

g_accuml  <=  CONV_STD_LOGIC_VECTOR(g_accumlt,16): 

END  PROCESS  quad_funcl; 

quad_func2:  PROCESS(X2,Y2) 

VARIABLE  tmpf,  tmpg  ;  signed(7  DOWNTO  0); 

VARIABLE  Laccumht,  g_accumht ;  signed(15  DOWNTO  0); 
VARIABLE  xin,  xin_n,  yin,  yin_n  :  signed(7  DOWNTO  0); 
VARIABLE  xin_t,  yin_t :  integer; 

BEGIN 

xin  :=X2; 

xin_t  :=  CONV_INTEGER(X2); 
xin_n  :=  CONV_SIGNED(-xin_t,8); 
yin  :=  Y2; 

yin_t  :=  CONV_INTEGER(Y2); 
yin_n  :=  CONV_SIGNED(-yin_t,8); 
if  ( xin_n  <=  yin )  AND  (  yin  <=  xin)  then 
tmpf  :=  yin; 
tmpg  :=  xin; 

elsif  ( yin_n  <=  xin )  AND  ( xin  <=  yin)  then 
tmpf  :=  xin; 
tmpg  ;=  yin; 

elsif  ( xin_n  >=  yin )  AND  ( yin  >=  xin)  then 
tmpf  :=  yin_n; 
tmpg  :=  xin_n; 

elsif  ( yin_n  >=  xin  )  AND  ( xin  >=  yin)  then 
tmpf  ;=  xin_n; 
tmpg  ;=  yin_n; 

end  if; 


Laccumht  ;=  sign_ext(tmpf); 

Laccumh  <=  CONV_STD_LOGIC_VECTOR(f_accumht,16); 
g_accumht  ;=  sign_ext(tmpg); 


END  PROCESS  quad_func2; 


•_  y  j  '-/ixtg_accumnt,  i  O); 


asfi  <=  NOT(f_accumh(  15)); 
asgl  <=N0T(g_accumh(15)); 

<=  f_accumh; 
g_lmpl  <=  g_accumh; 

absl  :  absl6  PORT  MAP(f_accumh,  f_accumh2); 
abs2  :  absl 6  PORT  MAP(g_accumh,  g_accumh2); 

addft:  add_subI6  PORT  MAP(f_accumb  Laccumh2,  asfl,  f_accumt,  OVRFt); 
addgt:  add_sub  1 6  PORT  MAP(g_accumI,  g_accumh2,  asg  U  g_accumt,  OVRGt); 

““OmpI  <=  Laccumt; 

-'g^tmpl  <=g_accumt; 

asf2  <=  NOT(f_accumt(15)); 

asg2  <=  NOT(g_accumt(15)); 

abs3  :  absl 6  PORT  MAP  (Laccumt,  f_accumt2); 

abs4  :  absl 6  PORT  MAP  (g_accumt,  g_accumt2); 

addf:  add_subl6  PORT  MAP(Laccum2,  Laccumt2,  asf2,  f^accuml,  OVRF); 
addg:  add_subl6  PORT  MAP(g_accum2,  g_accumt2,  asg2,  g_accuml,  OVRG); 

regl:  regl6  PORTMAP(cIk,  reset,  Laccuml,  Laccum2); 
reg2:  regl6  PORT  MAP(clk,  reset,  g_accuml,  g_accum2); 

f_int2  <=  to_intl6(f_accum2); 
gjnt2  <=  toJntl6(g_accum2); 

fjnt3  <=fJnt2/N; 
gjnt3  <=  g_int2  /  N; 

Lout  <=  fjnt3; 
g_out  <=  g_int3; 

f_tmp  <=  CONV_STD_LOGIC_VECTOR(f_int3,16); 
g_tmp  <=  CONV_STD_LOGIC_VECTOR(gJnt3,16); 

OVR  <=  OVRF  OR  OVRG; 

multi :  multl6  PORT  MAP  (f_tmp,g_tmp,ROUT); 


END  structural; 


5.2  Schematics  for  correlation  estimators 
5.2.1  Quadruplex  (Bit  Serial)  Correlation  Estimator 
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